Data retrieval performance optimizations in ArangoDB 3.3.9
Our recent release 3.3.9 includes several performance optimizations for data retrieval cases. Benefits can be expected for both storage engines, MMFiles and RocksDB, AQL batch lookup queries, and cluster AQL queries.
MMFiles index batch lookups
For the MMFiles engine, an optimization has been made for retrieving multiple documents from an index (hash index, skiplist index or persistent index) in a batch.
Previous versions of MMFiles fetched documents from these indexes one after the other, which had some considerable overhead compared to fetching many documents from inside the index code in a single batch.
This optimization will speed up index-based lookup queries such as follows, provided that the attribute value
is indexed via one of the mentioned index types, and a considerable amount of documents satisfies the query’s filter condition:
FOR doc IN collection
FILTER doc.value == @something
RETURN doc.value
To measure the effect of this optimization, the above query was run on a collection with one million documents and varying cardinality for the indexed value
attribute.
In 3.3.8, the average execution times (arithmetic mean of 50 runs) of this query are:
- hash index on “value”, index cardinality 20: 0.0319390 s
- hash index on “value”, index cardinality 1000: 0.000577 s
- skiplist index on “value”, index cardinality 20: 0.031699 s
- skiplist index on “value”, index cardinality 1000: 0.000561 s
With the optimizations in 3.3.9 the average execution times of the same query go down to:
- hash index on “value”, index cardinality 20: 0.0229566 s (28% reduction)
- hash index on “value”, index cardinality 1000: 0.000464 s (20% reduction)
- skiplist index on “value”, index cardinality 20: 0.023426 s (26% reduction)
- skiplist index on “value”, index cardinality 1000: 0.000481 s (14% reduction)
RocksDB optimizations
For the RocksDB engine, we added an index-only optimization for AQL queries that can retrieve all required document attributes directly from an index.
This optimization will be triggered for the RocksDB engine if an index is used that covers all required attributes of the document that are used later on in the query. If applied, it will save retrieving the actual document data (which would require an extra lookup in RocksDB), but will instead build the document data solely from the index values found. It will only be applied when using up to 5 attributes from the document, and only if the rest of the document data is not used later on in the query.
The optimization is currently available for the RocksDB engine for the index types
- primary
- edge
- hash
- skiplist
- persistent
It also has a great effect on the performance of queries such as the one from above:
FOR doc IN collection
FILTER doc.value == @something
RETURN doc.value
In 3.3.8, the average execution times (arithmetic mean of 50 runs) of this query are:
- hash index on “value”, index cardinality 20: 0.254912 s
- hash index on “value”, index cardinality 1000: 0.004033 s
- skiplist index on “value”, index cardinality 20: 0.241435 s
- skiplist index on “value”, index cardinality 1000: 0.004102 s
With the optimizations in 3.3.9, the execution times go down to just:
- hash index on “value”, index cardinality 20: 0.0388957 s (85% reduction)
- hash index on “value”, index cardinality 1000: 0.000878 s (78% reduction)
- skiplist index on “value”, index cardinality 20: 0.037555 s (84% reduction)
- skiplist index on “value”, index cardinality 1000: 0.000885 s (78% reduction)
There are also other types of queries that will benefit greatly from this optimization, for example:
FOR doc IN collection
COLLECT value = doc.value WITH COUNT INTO count
RETURN { value, count }
Depending on the type of index present on attribute value
, the execution time of this query goes down between 50% and 80%.
Also, consider this simple join query:
FOR doc1 IN collection1
FILTER doc1.value == @value
FOR doc2 IN collection2
FILTER doc1.value == doc2.value
RETURN { value1: doc1.value, value2: doc2.value }
The execution time for this query also went down by about 55%.
If this particular optimization is applied, it will show up as “index only” in an AQL query’s execution plan for an IndexNode.
Even if no indexes can be used by a query, some lookups in the RocksDB engine may get faster. Another optimization in 3.3.9 extends the already existing “reduce-extraction-to-projection” AQL optimizer rule for RocksDB to provide projections of up to 5 document attributes (the previous implementation only supported a projection for a single document attribute).
The optimization is beneficial especially for huge documents because it will copy out only the projected attributes from the document instead of copying the entire document data from the RocksDB storage engine.
When applied, the AQL explainer will show the projected attributes in a “projections” remark for an EnumerateCollectionNode or IndexNode.
Scan-only optimization
Another optimization was added for AQL queries that iterate over collections or indexes but that do not need to return the actual document values.
Not fetching the actual document values from the storage engine will provide a considerable speedup when using the RocksDB engine, but may also help a bit in case of the MMFiles engine. The optimization will only be applied when full-scanning or index-scanning a collection without referring to any of its documents later on, and, for an IndexNode, if all filter conditions for the documents of the collection are covered by the index.
If the optimization is applied, it will show up as “scan only” in an AQL query’s execution plan for an EnumerateCollectionNode or an IndexNode.
Cluster AQL query optimizations
Yet another optimization in 3.3.9 extends the existing “collect-in-cluster” optimizer rule to push grouping, counting and deduplication onto the DB servers in many cases, so that the coordinator will only need to sum up the potentially smaller
results from the individual shards.
The following types of COLLECT queries are covered now:
- RETURN DISTINCT expr
- COLLECT WITH COUNT INTO …
- COLLECT var1 = expr1, …, varn = exprn (WITH COUNT INTO …), without INTO or KEEP
- COLLECT var1 = expr1, …, varn = exprn AGGREGATE …, without INTO or KEEP, for
aggregate functions COUNT/LENGTH, SUM, MIN and MAX.
Furthermore, another optimization in 3.3.9 removes the need for post-sorting results from multiple shards of the same collection in case the documents are retrieved from an index, but the index is not used for sorting. Previous versions of ArangoDB always sorted the results from the shards so a total order was kept.
This optimization can speed up gathering data from multiple shards when using an index only for filtering because it allows removing the merge sort of the individual shards’ results.
And finally, we have found a way to reduce the number of intra-node calls for cluster AQL queries slightly, which will reduce the number of cluster-internal HTTP requests for every single AQL query but will have the greatest effect for very simple AQL queries.
You can download 3.3.9 here:
Get the latest tutorials, blog posts and news: