Index types and how indexes are used in ArangoDB: Part II
In the first part of this article we dived deep into what indexes are currently available in ArangoDB (3.2 and 3.3), also briefly looking at what improvements are coming with ArangoDB 3.4. Read Part I here.
In this Part II, we are going to focus on how to actually add indexes to a data model and speed up specific queries.
Adding indexes to the data model
The goal of adding an extra index to the data model is to speed up a certain query or even multiple queries.
One of the first things that should be done during development of AQL queries should be to review the output of the explain
command. A query can be explained using ArangoDB’s WEB UI or from the ArangoShell. In the ArangoShell it is as simple as db._explain(query)
, where query
is the AQL query string. To explain a query which also has bind parameters, they need to be passed separately into the command, e.g. db._explain(query, bindParameters)
.
Use case: adding a hash index to avoid a full collection scan
When there are no indexes present and the query accesses a collection with a lot of documents, it will very likely end up doing a full collection scan, meaning that every document in the collection is inspected (potentially even read from disk). This will take time proportional to the number of documents in the collection.
Consider the following query on a collection `test` that was just created (i.e. no indexes present apart from the primary index):
arangosh> db._explain(`FOR doc IN test FILTER doc.value == 4711 RETURN doc`)
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 1000000 - FOR doc IN test /* full collection scan */
3 CalculationNode 1000000 - LET #1 = (doc.`value` == 4711) /* simple expression */
4 FilterNode 1000000 - FILTER #1
5 ReturnNode 1000000 - RETURN doc
Indexes used:
none
This shows that no indexes will be used when executing this query, and that the optimizer has picked a full collection scan to look at all the documents in the collection (note: this is indicated by EnumerateCollectionNode
in the execution plan above. EnumerateCollectionNode
is the name of some component that scans all the documents in a collection). As the collection contains 1,000,000 documents, this query is almost guaranteed to be slow without a proper index.
To make this query perform better, an index on value
is clearly needed. Candidate index types for the new index are hash
or skiplist
.
- For MMFiles, the difference between these two types is huge, as a hash index will have amortized complexity of O(1) for all operations, whereas a skiplist index has amortized complexity of O(log n). That means the skiplist index will be much more expensive than a hash index in MMFiles.
- For the RocksDB engine, both index types can be used interchangeably.
As a general rule, we should use a hash index if we are certain all queries will look up value
by equality (as above). If that is not known in advance, or if there will be queries that look up value
using range lookups, or that sort by value
, we may be better off with a skiplist index at least for MMFiles.
The next decision for the new index is if it can be made unique (which may slightly improve lookup performance) or not. In this case we know that value
is not going to be unique in that collection so we need to pick a non-unique index. We also know that all documents in that collection do contain a non-null
value
attribute, so we will not benefit from a sparse index here at all.
By creating a non-unique, non-sparse hash index on value
, the same query’s execution plan greatly improves to just:
Query string:
FOR doc IN test FILTER doc.value == 4711 RETURN doc
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 200 - FOR doc IN test /* hash index scan */
5 ReturnNode 200 - RETURN doc
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash test false false 0.50 % [ `value` ] (doc.`value` == 4711)
with an estimated 200 documents to look at instead of 1,000,000. That is an improvement by several orders of magnitude.
Use case: adding a combined index
Assume we have a query like the following, which will try to find documents for a given category and with timestamps that fall into a specific time period:
FOR doc IN test
FILTER doc.category == 'test2'
FILTER doc.timestamp > DATE_NOW() - 100000 AND doc.timestamp < DATE_NOW()
RETURN doc
Now it is possible to create a hash index on the category `attribute` and a skiplist index on timestamp
.
This will bring up an execution plan that only uses one of the indexes we just created:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 IndexNode 10000 - FOR doc IN test /* hash index scan */
5 CalculationNode 10000 - LET #3 = ((doc.`timestamp` > (DATE_NOW() - 100000)) && (doc.`timestamp` < DATE_NOW()))
6 FilterNode 10000 - FILTER #3
7 ReturnNode 10000 - RETURN doc
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
8 hash test false false 0.01 % [ `category` ] (doc.`category` == "test2")
Why doesn’t the query use both indexes?
In general, the query optimizer will pick one index per FOR loop in an AQL query and use it. Even though there are two indexes available, it is not guaranteed that using the two index would produce a better result in terms of execution speed.
So the optimizer will pick the one index for which it thinks the less documents will need to be inspected and returned.
Still this can be improved. Instead of creating individual indexes on category
and timestamp
, it would be much better here to create a combined skiplist index on category
and timestamp
(in this order).
With that index, the execution plan get simplified to
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
8 IndexNode 666 - FOR doc IN test /* skiplist index scan */
7 ReturnNode 666 - RETURN doc
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
8 skiplist test false false 99.94 % [ `category`, `timestamp` ] ((doc.`category` == "test2") && (doc.`timestamp` > (DATE_NOW() - 100000)) && (doc.`timestamp` < DATE_NOW()))
which will likely correspond to far less documents that need to be inspected and thus much better query execution performance. As a bonus, that combined index can also be used for filtering on category
alone, for sorting on category
, or for sorting on category
and timestamp
.
For example, if our query is
FOR doc IN test
FILTER doc.category == 'test2'
FILTER doc.timestamp > DATE_NOW() - 100000 AND doc.timestamp < DATE_NOW()
SORT doc.timestamp DESC
RETURN doc
the execution plan of it will not be any more complex, thanks to the combined index that can handle the sort:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
10 IndexNode 666 - FOR doc IN test /* reverse skiplist index scan */
9 ReturnNode 666 - RETURN doc
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
10 skiplist test false false 99.94 % [ `category`, `timestamp` ] ((doc.`category` == "test2") && (doc.`timestamp` > (DATE_NOW() - 100000)) && (doc.`timestamp` < DATE_NOW()))
Note: sorting on timestamp
here is only possible because we are filtering on a constant category
in a way, so we will only see a single category
value being returned by this query. The query optimizer is smart enough to detect this and will use the index for sorting even if category
was not specified in the SORT clause.
One vs. multiple indexes
A query can use more than one index per collection, but for each FOR loop in an AQL query the optimizer will use at most one index if the filter conditions are combined with logical AND. The optimizer will try to pick the index that promises to reduce work (filtering, sorting) by the greatest amount. If there are multiple candidate indexes, they will all be looked at, but the one with the least expected cost will be chosen. Therefore it often makes sense to use combined indexes.
In case there are filter conditions on different attributes combined with logical OR, the optimizer may pick more than one index even for the same FOR loop.
Things to avoid
Creating too many indexes
Creating extra indexes comes with a cost, consisting of storage costs and runtime overhead.
The MMFiles engine has in-memory indexes, so every index that is created will hog a fraction of the scarce RAM. For the RocksDB engine the indexes will be stored on disk, so they will take up disk space. Additionally, even though the RocksDB engine’s indexes will not be built up in RAM, they may still use some RAM because index values may land in some in-memory caches too.
Regarding runtime overhead, here is the list of situations when indexes have an effect on performance:
- The index is initially created, and all documents from the collections are inserted into the index. This will take time proportional to the number of documents in the collection, and the runtime will also depend on the algorithmic complexity of the index internals.
- On every document insert, update, replace or remove operation the index needs to be changed as well. This will extend the time required for all document operations that affect the index.
- The collection is loaded again after a server restarted, and the in-memory index is built up from the on-disk document data (note: this is relevant for the MMFiles engine only, it will not happen for the RocksDB engine as it has all its indexes on disk and does not build them up in memory on collection load).
Therefore one only wants to create the indexes that are actually required. It makes no sense to index every existing attribute, because that is guaranteed to slow down writes, but at the same time has only a questionable effect on the performance of lookup queries.
Often it makes most sense to index the attributes that are used in the majority of queries, so that also the majority of queries improves. It is even more useful to find combinations of attributes that are used together in several queries, and then create combined indexes on them.
When in doubt, one should measure the effect of extra indexes on queries and data-modification operations, by using explain
, and by tracking query execution times in the client applications.
However, not using indexes where they could greatly improve the situation should also be avoided for obvious reasons.
Using Function calls on index attributes
In contrast to some other databases, ArangoDB does not provide any indexes that can store functions results or other evaluated expression values. Using a function on an indexed document attribute in a FILTER condition will very likely render the index useless for that query.
For example, assume there is an index present on attribute value
in collection test
. The following query will try to compare all values from the index in their lower-cased variant against a constant search string. However, it will not use the index, because the index cannot provide the result of the LOWER
AQL function. So instead the query will do a full collection scan:
FOR doc IN test FILTER LOWER(doc.value) == 'bart' RETURN doc
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 100000 - FOR doc IN test /* full collection scan */
3 CalculationNode 100000 - LET #1 = (LOWER(doc.`value`) == "bart")
4 FilterNode 100000 - FILTER #1
5 ReturnNode 100000 - RETURN doc
Indexes used:
none
To make this query use an index on attribute value
, there are two options, which both require slight changes to the data model.
- if possible, only store
value
values as completely lower-cased. Then it is only required to convert all lookup values to lowercase. - add an extra attribute
lowercasedValue
to the collection to store the lower-cased version ofvalue
, solely for the purpose of using an index for it. The index then needs to be created onlowercasedValue
and notvalue
.
An example for the latter follows:
FOR doc IN test FILTER doc.lowercasedValue == LOWER('bart') RETURN doc
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 1 - FOR doc IN test /* hash index scan */
5 ReturnNode 1 - RETURN doc
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash test false false 99.99 % [ `lowercasedValue` ] (doc.`lowercasedValue` == "bart")
Creating sparse indexes
In some cases declaring an index sparse will prevent the query optimizer from using it.
For example, this two-collection join will turn into a nightmare with sparse indexes.
FOR doc1 IN test1 FOR doc2 IN test2 FILTER doc1.value == doc2.value RETURN doc1
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
3 EnumerateCollectionNode 0 - FOR doc2 IN test2 /* full collection scan */
2 EnumerateCollectionNode 0 - FOR doc1 IN test1 /* full collection scan */
4 CalculationNode 0 - LET #2 = (doc1.`value` == doc2.`value`)
5 FilterNode 0 - FILTER #2
6 ReturnNode 0 - RETURN doc1
The reason for not using the indexes here is that in each collection there may be documents that do not have a value
attribute, or contain a value
attribute with a value of null
. If the optimizer would be using any of the sparse indexes on value
, it would not find these documents (as they would not have been indexed). So without using the indexes the query would produce more results than with using the indexes. This is a situation that the optimizer must avoid, so it will discard the sparse indexes and go on without. Note that it would still use non-sparse indexes here if present, but the point here is to show that declaring an index sparse can have undesired effects for some queries.
So an index should only be declared sparse when all queries using that attribute have been checked and it was found that the index can still be used for them even if sparse.
What’s next
We are currently working on ArangoSearch (coming with ArangoDB 3.4) which will be a superior replacement for the existing fulltext index. It offers relevance ranking, allows indexing of multiple attributes and supports multiple languages (including Chinese & Russian). You can already try it out by downloading a 3.4 Milestone and going through ArangoSearch tutorial.
We are also working on significantly extending and improving geo index functionality coming with ArangoDB 3.4. It will offer more advanced querying functionality, performance optimizations and support for standard GeoJSON types (polygons, multi-polygons, polylines, intersections, etc.).
Join our Community Slack to give us feedback, improvement suggestions, if you have any questions or just to chat with us.
Get the latest tutorials, blog posts and news: