AQL Optimizer Improvements in ArangoDB 2.8
With the 2.8 beta phase coming to an end it’s time to shed some light on the improvements in the 2.8 AQL optimizer. This blog post summarizes a few of them, focusing on the query optimizer. There’ll be a follow-up post that will explain dedicated new AQL features soon.
Array indexes
2.8 allows creating hash and skiplist indexes on attributes which are arrays. Creating such index works similar to creating a non-array index, with the exception that the name of the array attribute needs to be followed by a [*]
in the index fields definition:
db._create("posts");
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
Now if the tags
attribute of a document in the posts
collection is an array, each array member will be inserted into the index:
db.posts.insert({ tags: [ "arangodb", "database", "aql" ] });
db.posts.insert({ tags: [ "arangodb", "v8", "javascript" ] });
db.posts.insert({ tags: [ "javascript", "v8", "nodejs" ] });
The index on tags[*]
will now contain the values arangodb
, database
, aql
and nosql
for the first document, arangodb
, v8
and javascript
for the second, and javascript
, v8
and nodejs
for the third.
The following AQL will find any documents that have a value of javascript
contained in their tags
value:
FOR doc IN posts
FILTER 'javascript' IN doc.tags[*]
RETURN doc`
This will use the array index on tags[*]
.
The array index works by inserting all members from an array into the index separately. Duplicates are removed automatically when populating the index.
An array index can also be created on a sub-attribute of array members. For example, the following definition will make sure the name
sub-attributes of the tags
array values will make it into the index:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*].name" ] });`
That will allow storing data as follows:
db.posts.insert({ tags: [ { name: "arangodb" }, { name: "database" }, { name: "aql" } ] });
db.posts.insert({ tags: [ { name: "arangodb" }, { name: "v8" }, { name: "javascript" } ] });
db.posts.insert({ tags: [ { name: "javascript" }, { name: "v8" }, { name: "nodejs" } ] });`
The (index-based) selection query for this data structure then becomes
FOR doc IN posts
FILTER 'javascript' IN doc.tags[*].name
RETURN doc`
Contrary to MongoDB, there is no automatic conversion to array values when inserting non-array values in ArangoDB. For example, the following plain strings will not be inserted into an array index, simply because the value of the index attribute is not an array:
db.posts.insert({ tags: "arangodb" });
db.posts.insert({ tags: "javascript" });
db.posts.insert({ tags: "nodejs" });`
Note that in this case a non-array index can still be used.
Use of multiple indexes per collection
The query optimizer can now make use of multiple indexes if multiple filter conditions are combined with logical ORs, and all of them are covered by indexes of the same collection.
Provided there are separate indexes present on name
and status
, the following query can make use of index scans in 2.8, as opposed to full collection scans in 2.7:
FOR doc IN users
FILTER doc.name == 'root' || doc.status == 'active'
RETURN doc`
If multiple filter conditions match for the same document, the result will automatically be deduplicated, so each document is still returned at most once.
Sorted IN comparison
Another improvement for the optimizer is to pre-sort comparison values for IN
and NOT IN
so these operators can use a (much faster) binary search instead of a linear search.
The optimization will be applied automatically for IN
/ NOT IN
comparison values used in filters, which are used inside of a FOR
loop, and depend on runtime values. For example, the optimization will be applied for the following query:
LET values = /* some runtime expression here */
FOR doc IN collection
FILTER doc.value IN values
RETURN doc`
The optimization will not be applied for IN
comparison values that are value literals and those that are used in index lookups. For these cases the comparison values were already deduplicated and sorted.
“sort-in-values” will appear in the list of applied optimizer rules if the optimizer could apply the optimization:
Optimization for LENGTH(collection)
There are multiple ways for counting the total number of documents in a collection from inside an AQL query. One obvious way is to use RETURN LENGTH(collection)
.
That variant however was inefficient as it fully materialized the documents before counting them. In 2.8 calling LENGTH()
for a collection will get automatically replaced by a call to a special function that can efficiently determine the number of documents. For larger collections, this can be several thousand times faster than the naive 2.7 solution.
C++ implementation for many AQL functions
Many existing AQL functions have been backed with a C++ implementation that removes the need for some data conversion that would otherwise happen if the function were implemented in V8/JavaScript only. More than 30+ functions have been changed, including several that may produce bigger result sets (such as EDGES()
, FULLTEXT()
, WITHIN()
, NEAR()
) and that will hugely benefit from this.
Improved skip performance
2.8 improves the performance of skipping over many documents in case no indexes and no filters are used. This might sound like an edge case, but it is quite common when the task is to fetch documents from a big collection in chunks and there is certainty that there will be no parallel modifications.
For example, the following query runs about 3 to 5 times faster in 2.8, and this improvements can easily sum up to notable speedups if the query is called repeatedly with increasing offset values for LIMIT
:
FOR doc IN collection
LIMIT 1000000, 10
RETURN doc`
Get the latest tutorials, blog posts and news: