String Comparison Performance: ArangoDB Query Optimization

May 5 2015,/Performance

We’ve been using Callgrind with its powerful frontend KCachegrind for quiet some time to analyse where the hot spots can be found inside of ArangoDB. One thing always accounting for a huge chunk of the resource usage was string comparison. Yes, string comparison isn’t as cheap as one may think, but its been even a bit more than one would expect. And since much of the business of a database is string comparison, its used a lot.

ArangoDB and V8 use the ICU Library for these purposes (with no alternatives on the market) – so basically we heavily rely on the performance of the ICU library. However, one line in the ICU change-log – ‘Performance: string comparisons significantly faster’ – made us listen up.

So it was a crystal clear objective to take advantage of these performance improvements. As we use the ICU bundled with V8, we had to make sure it would work smooth for it first ;-). After enrolling the upgrade, we wanted to know whether everything was working fine with valgrind etc, and get some figures how much the actual improvement is.

Therefore we set up some test-cases that perform a lot of string comparisons.

Sorting 6 million strings

Sorting strings involves comparing them. Depending on the sort algorithm and the number of items – the amount of comparisons varies. We create a collection with no index on our string key, so the sorting has to be done in memory after fetching the items:

arangosh> var time = require("internal").time;
arangosh> db._create("testCollection");
arangosh> for (var i = 0; i < 1000000; ++i) {
........>  db.testCollection.save({ value: "test" + i}); 
........> }
arangosh> var time = require("internal").time;
arangosh> var before = time (); db._query("FOR doc IN testCollection SORT doc.value RETURN doc.value"); time() - before

ICU 52.1: 19s
ICU 54.1: 7.5s

So for this case, we only use 40% of the resources after the upgrade.

Improving indexes

Skiplist indices also lean on comparing the items they access; The resources used for each execution of the above query are now moved to the time the collection is loaded. We reuse the above collection and add an index:

arangosh> var time = require("internal").time;
arangosh> db._create("testCollection");
arangosh> for (var i = 0; i < 1000000; ++i) {
........>  db.testCollection.save({ value: "test" + i}); 
........> }
arangosh> require("internal").wal.flush(true, true);
arangosh> db.testCollection.ensureSkiplist("value");

After generating the index, we ensure everything is written to the disk, unload the collection, and load it again to watch how long it takes:

arangosh> db.testCollection.unload();
arangosh> db.testCollection
[ArangoCollection 42438187, "testCollection" (type document, status unloading)]

We need to wait until the status changes from unloading to unloaded. If it doesn’t work out after a minute or so, restarting arangod may force the situation to be clear. Now we can do the real test:

arangosh> var before = time (); db.testCollection.load(); time() - before

ICU 52.1: 23s
ICU 54.1: 8s

So here we roughly use 35% of the resources with the new version.

Conclusion

You may expect huge performance improvements in ArangoDB 2.6!

Willi Goesgens

Willi is part of ArangoDB Core team. He primarily works on performance tuning, the release process of ArangoDB and helping the community with his broad experience.

May 5 2015,Willi Goesgens

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

String Comparison Performance: ArangoDB Query Optimization

Sorting 6 million strings

Improving indexes

Conclusion

Willi Goesgens

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch