home shape

ArangoDB Collection Disk Usage Analysis | ArangoDB 2012

In this post we’ll explain how ArangoDB stores collection data on disk and look at its storage space requirements, compared to other popular NoSQL databases such as CouchDB and MongoDB.

How ArangoDB allocates disk space

ArangoDB stores documents in collections. The collection data is persisted on disk so it does not get lost in case of a server restart.

When a collection gets created (either explicitly or by inserting the first document into it), a separate directory is created for the collection on disk. ArangoDB will also create a so-called “journal” file for the collection that the document data will be written into. The “journal” file will have a file size of 32 MB by default.

The value of 32 MB is configurable at server restart by setting the “–database.maximal-journal-size” option. It will be used for all subsequently created journal files. Existing journal files will not be changed, though. The journal size is also configurable on a per collection basis by setting the “journalSize” option when creating the collection. The minimum value is 1 MB.

When the journal file is first created, it is prefilled with zero-bytes so it will already take up as much disk space as the journal size value was set to. That means even if a collection contains only one small document, the journal file will already take up the 32 MB on disk. Adding more documents will then fill up blocks in the existing journal file, not taking additional disk space.

Only if the journal file is filled up, a new journal file will be created. The previous journal file will be made a “datafile” then. The distinction between journal files and datafiles is that journal files are actively written to whereas datafiles are immutable. It should be noted that after the initial allocation with zero-bytes, journal files are written in append-only fashion. There are no in-place modifications of document data in journals or datafiles.

As mentioned before, ArangoDB claims disk space for collection data in chunks that have a certain size. Disk usage is not increasing for each inserted document, but only when a journal file gets full and rotated. Obviously, the journal size is a factor for disk space consumption so it is a configurable value in ArangoDB.

ArangoDB will also allocate 2 MB of storage space per collection to store document structure and data type information. These “so-called” shapes are an important design aspect of ArangoDB that is explained below. Furthermore, ArangoDB will create an initial compaction file per collection, with the same size as the journal file.

Some storage design considerations

Preallocation/prefilling

The preallocation/prefilling that ArangoDB employs might not make sense at first, but it is done for a reason. In some environments, overwriting an existing (prefilled) file is faster than appending new blocks to an existing file. Furthermore, allocating storage in bigger chunks might help reduce file system fragmentation.

Though preallocation/prefilling can improve performance considerably, it comes at a cost: first of all, disk usage is not directly proportional to the number/size of documents inserted. More importantly, the storage overhead may be relatively high for small datasets.

In ArangoDB, the overhead of preallocation/prefilling is configurable by setting the journal size appropriately. The value can be adjusted at collection level. The effect of different journal sizes was measured in this test and can be found below in the columns “ArangoDB, 32 MB journal” and “ArangoDB, 4 MB journal”.

CouchDB does not seem to preallocate space and therefore it does not have much initial overhead for small datasets. Storage space can be saved by using compression data. Using Snappy compression in CouchDB reduced disk usage by about 20 to 60 %, depending on the dataset used. The compressed data sizes are available in the results below in column “CouchDB, snappy” (with compression) and “CouchDB” (no compression).

In MongoDB, the disk space allocation is done in blocks of increasing size (64 MB, 128 MB, 256 MB, 512 MB, 1 GB, 2 GB). The first block consumes 64 MB already. There is also a startup option –smallfiles that modifies this series to [16 MB … 512 MB] to reduce space overhead. This is turned off by default and was not covered in these tests. MongoDB by default will also preallocate the next block before the current block is filled up. This has the advantage that the next block is likely to be already available when needed, but makes it consume even more disk space. This preallocation can also be turned off by starting with option –noprealloc. The results for this are present in the column “MongoDB, no prealloc”. As in ArangoDB, disk usage in MongoDB is not directly proportional to the number/size of documents inserted.

Data formats

While preallocation space overhead matters for small datasets, its effect becomes less important for bigger datasets. For bigger datasets, it’s especially important how efficient the storage format of a database is and if patterns in the data can be exploited to compress it.

Storing the incoming JSON data as plain text would be too inefficient so ArangoDB stores data in some binary format, and so do CouchDB and MongoDB.

ArangoDB separates the document structure and the actual document data when saving a document. Document structure information, consisting of attribute names and attribute data types, is stored as so-called “shapes”. The document data stored will only contain a shape-id (a reference to an existing shape), and multiple documents can point to the same shapes. This helps in reducing disk usage when many or even all documents in a collection have the same structure.

CouchDB supports compression out of the box, but this comes at some performance cost (CPU cycles will be spent for compressing and uncompression data) so it is turned off by default. Apart from that it seems that CouchDB stores all attribute names individually for each document inserted, even if all documents of a collection/database share identical attribute names.

MongoDB uses an optimised binary data representation (BSON) as the internal storage format, and also seems to store repeated document structure information redundantly.

Actual storage space requirements

We have measured the actual disk usage in ArangoDB for some real-world and artificial datasets. For ArangoDB, we have used journal sizes of 32 MB (the default value) and 4 MB to illustrate the difference. Furthermore, we have imported the same datasets into other document databases, CouchDB and MongoDB, to see how much disk space they require. We’ve used CouchDB 1.2 without file compression and with Snappy compression. We’ve tested MongoDB 2.1.3 with and without preallocation.

Test datasets

The following datasets have been tested:

Dataset name Description Number of documents Average document size (bytes)
names1000 person records, containg names and addresses, artificially created with source data from US census bureau, ZIP code and state lists 1.000 331
names10000 same 10.000 331
names100000 same 100.000 331
enron email corpus e-mail data, published by Federal Energy Commission 41.299 3.895
wiki50 Wikipedia articles 50 23.905
wiki500 same 500 22.935
wiki5000 same 5.000 23.108
wiki50000 same 50.000 11.770
access logs Apache web server access logs 1.357.246 258
aol search queries search engine queries, published by AOL 3.459.421 111

All datasets were JSONified and imported into the beforementioned databases using arangoimp (ArangoDB), the _bulk_docs API (CouchDB), and mongoimport (MongoDB). For each dataset, the total actual disk allocation in bytes as reported by the filesystem was measured.

Test results

Storage space requirements on an ext4 filesystem were as follows. All values are reported in MB (1,048,576 bytes).

Dataset name ArangoDB, 32MB journal ArangoDB, 4MB journal CouchDB CouchDB, snappy MongoDB MongoDB, no prealloc
names1000 * 66.36 * 10.36 0.66 0.47 208.01 80.01
names10000 66.36 10.36 6.61 4.68 208.01 80.01
names100000 98.36 46.36 66.13 46.74 208.01 80.01
enron email corpus 203.75 171.38 163.04 95.22 464.01 208.01
wiki50 * 66.36 * 10.36 1.13 0.63 208.01 80.01
wiki500 66.36 18.23 10.93 6.55 208.01 80.01
wiki5000 162.30 117.45 110.09 65.27 464.01 208.01
wiki50000 609.88 563.38 567.68 343.95 2000.01 976.01
access logs 450.36 406.35 677.79 521.32 2000.01 976.01
aol search queries 578.36 542.36 2067.70 843.86 2000.01 976.01

* if you might wonder why the actual values are different to the expected 32 MB and 4 MB, the difference is due to ArangoDB creating a journal file and a compaction file that sum up to two times the specified journal size, and a shape file for 2 MB.

Final notes

If you intend to store lots of documents with identical structures, ArangoDB might be for you. You might well save some storage space. This is especially true if you plan on using long attribute names. ArangoDB stores attribute names in the shape data, so attribute names are not stored repeatedly for documents that use the same shapes.

Setting the journal size to a lower value than the default 32 MB might be sensible if you plan to have a lot of small (in terms of number of documents) collections in ArangoDB. Setting the journal size to a lower value when creating small collections might save considerable disk space if you do this often.

This should not be done for collections that you plan to insert documents into frequently. Reducing journal size for such collections might still reduce disk space usage but may also decrease write performance because more files need to be created and synced. So this is a trade-off.

Jan Steemann

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

2 Comments

  1. aromatix on July 27, 2012 at 8:20 am

    i think a better comparison shoult take into account reads inserts and updates to be more informative with informations about the proc architecture (32/64).

    the test seems have been done on single server node ( what about comparing sharding / clustering ) to compare relevant uses

    anyway this is very hopeful, and will be more if the stats go the same way for updates (important in the real world)

    • jan steemann on July 27, 2012 at 5:49 pm

      @aromatix: The architecture was 64 bits for all tests.
      And the tests were done on a single server node. The reason for this is that ArangoDB currently does not support sharding/clustering. This is planned for a future release.

      We’d like to conduct some more benchmarks (including updates) when time allows. If we do, we’ll of course post the results here.

Leave a Comment





Get the latest tutorials, blog posts and news: