Bulk Inserts Comparison: MongoDB, CouchDB, ArangoDB ’12
In the last couple of posts, we have been looking at ArangoDB’s insert performance when using individual document insert, delete, and update operations. This time we’ll be looking at batched inserts. To have some reference, we’ll compare the results of ArangoDB to what can be achieved with CouchDB and MongoDB.
Test setup
We have used the bulk insert benchmark tool to generate results for MongoDB, CouchDB, and ArangoDB. The benchmark tool uses the HTTP bulk documents APIs for CouchDB and ArangoDB, and the binary protocol for MongoDB (as MongoDB does not have an HTTP bulk API). The benchmark tool was run on the same machine as the database servers so network latency can be ruled out as an influence factor. The test machine specifications are:
- Linux Kernel 2.6.37.6-0.11, cfq scheduler
- 64 bit OS
- 8x Intel(R) Core(TM) i7 CPU, 2.67 GHz
- 12 GB total RAM
- SATA II hard drive (7.200 RPM, 32 MB cache)
The total “net insert time” (time spent in the benchmark tool for sending to request to the database and waiting for the database response, i.e. excluding the time needed to generate the document data) is reported for several datasets in the following charts.
The database versions used for tests were:
- MongoDB 2.1.3, with preallocation
- CouchDB 1.2, with delayed_commits, without compression
- ArangoDB 1.1-alpha, with waitForSync=false
The datasets tested can be categorised in three groups: small, medium, and big. The small datasets tested were:
Dataset name | Description | Number of documents |
uniform_1000 | One attribute plus unique „_id“ value | 1,000 |
uniform_10000 | same, but 10,000 documents | 10,000 |
names_10000 | person records containing names and address, artificially created with source data from US census bureau, ZIP code and state lists | 10,000 |
The medium datasets tested were:
Dataset name | Description | Number of documents |
enron | enron e-mail corpus, published by Federal Energy Commission | 41,299 |
names_100000 | person records containing names and address, artificially created with source data from US census bureau, ZIP code and state lists | 100,000 |
names_300000 | same, but 300,000 documents | 300,000 |
wiki_50000 | Wikipedia articles | 50,000 |
The big datasets tested consisted of:
Dataset name | Description | Number of documents |
uniform_1000000 | One attribute plus unique „_id“ value | 1,000,000 |
uniform_10000000 | same, but 10,000,000 documents | 10,000,000 |
aol | search engine queries published by AOL | 3,459,421 |
accesslogs | Apache web server access logs | 1,357,246 |
Results, small datasets
For the smallest dataset (uniform_1000), the results were almost on par, with CouchDB being slightly faster than MongoDB than ArangoDB. For the other small datasets tested, MongoDB was slightly faster than ArangoDB, and both being notably faster than CouchDB.
Results, medium datasets
For the medium datasets, MongoDB was fastest for the first two sets tested, and ArangoDB was fastest for the other two sets. CouchDB was slightly slower for two of the datasets, and substantially slower for the two other.
Results, big datasets
With the bigger datasets tested, ArangoDB had the lowest bulk insert times. MongoDB was slightly slower for three of the cases tested, and substantially longer for the other case (uniform_10000000). CouchDB consisrently had the highest insertion time.
Conclusion
With the datasets tested, ArangoDB was on par with MongoDB (with MongoDB being slightly faster in some cases and ArangoDB in others). CouchDB was notably slower than MongoDB and ArangoDB, except in one case.
Caveats
These are benchmarks for specific datasets. The dataset volumes and types might or might not be realistic, depending on what you plan to do with a database. Results might look completely different for other datasets.
In addition, the benchmarks compare the HTTP API of CouchDB and ArangoDB against the binary protocol of MongoDB, which gives MongoDB a slight efficiency advantage. However, real-world applications will also use Mongo’s binary protocol so this is an advantage that MongoDB does have in real life (though it comes with the disadvantage that the protocol is not human-readable).
Furthermore, there are of course other aspects that would deserve observation, e.g. datafile size, memory usage. These aspects haven’t been looked at in this post.
So please be sure to run your own tests in your own environment before adopting the results.
Get the latest tutorials, blog posts and news: