String Comparison Performance: ArangoDB Query Optimization
We’ve been using Callgrind with its powerful frontend KCachegrind for quiet some time to analyse where the hot spots can be found inside of ArangoDB. One thing always accounting for a huge chunk of the resource usage was string comparison. Yes, string comparison isn’t as cheap as one may think, but its been even a bit more than one would expect. And since much of the business of a database is string comparison, its used a lot.
ArangoDB and V8 use the ICU Library for these purposes (with no alternatives on the market) – so basically we heavily rely on the performance of the ICU library. However, one line in the ICU change-log – ‘Performance: string comparisons significantly faster’ – made us listen up.
So it was a crystal clear objective to take advantage of these performance improvements. As we use the ICU bundled with V8, we had to make sure it would work smooth for it first ;-). After enrolling the upgrade, we wanted to know whether everything was working fine with valgrind etc, and get some figures how much the actual improvement is.
Therefore we set up some test-cases that perform a lot of string comparisons.
Sorting 6 million strings
Sorting strings involves comparing them. Depending on the sort algorithm and the number of items – the amount of comparisons varies. We create a collection with no index on our string key, so the sorting has to be done in memory after fetching the items:
arangosh> var time = require("internal").time;
arangosh> db._create("testCollection");
arangosh> for (var i = 0; i < 1000000; ++i) {
........> db.testCollection.save({ value: "test" + i});
........> }
arangosh> var time = require("internal").time;
arangosh> var before = time (); db._query("FOR doc IN testCollection SORT doc.value RETURN doc.value"); time() - before
- ICU 52.1: 19s
- ICU 54.1: 7.5s
So for this case, we only use 40% of the resources after the upgrade.
Improving indexes
Skiplist indices also lean on comparing the items they access; The resources used for each execution of the above query are now moved to the time the collection is loaded. We reuse the above collection and add an index:
arangosh> var time = require("internal").time;
arangosh> db._create("testCollection");
arangosh> for (var i = 0; i < 1000000; ++i) {
........> db.testCollection.save({ value: "test" + i});
........> }
arangosh> require("internal").wal.flush(true, true);
arangosh> db.testCollection.ensureSkiplist("value");
After generating the index, we ensure everything is written to the disk, unload the collection, and load it again to watch how long it takes:
arangosh> db.testCollection.unload();
arangosh> db.testCollection
[ArangoCollection 42438187, "testCollection" (type document, status unloading)]
We need to wait until the status changes from unloading to unloaded. If it doesn’t work out after a minute or so, restarting arangod may force the situation to be clear. Now we can do the real test:
arangosh> var before = time (); db.testCollection.load(); time() - before
- ICU 52.1: 23s
- ICU 54.1: 8s
So here we roughly use 35% of the resources with the new version.
Conclusion
You may expect huge performance improvements in ArangoDB 2.6!
Get the latest tutorials, blog posts and news: