Index types and how indexes are used in ArangoDB: Part II

March 1 2018,/General

In the first part of this article we dived deep into what indexes are currently available in ArangoDB (3.2 and 3.3), also briefly looking at what improvements are coming with ArangoDB 3.4. Read Part I here.

In this Part II, we are going to focus on how to actually add indexes to a data model and speed up specific queries.

Adding indexes to the data model

The goal of adding an extra index to the data model is to speed up a certain query or even multiple queries.

One of the first things that should be done during development of AQL queries should be to review the output of the explain command. A query can be explained using ArangoDB’s WEB UI or from the ArangoShell. In the ArangoShell it is as simple as db._explain(query), where query is the AQL query string. To explain a query which also has bind parameters, they need to be passed separately into the command, e.g. db._explain(query, bindParameters).

Use case: adding a hash index to avoid a full collection scan

When there are no indexes present and the query accesses a collection with a lot of documents, it will very likely end up doing a full collection scan, meaning that every document in the collection is inspected (potentially even read from disk). This will take time proportional to the number of documents in the collection.

Consider the following query on a collection `test` that was just created (i.e. no indexes present apart from the primary index):


arangosh> db._explain(`FOR doc IN test FILTER doc.value == 4711 RETURN doc`)

Execution plan:
 Id   NodeType                     Est.   Comment
  1   SingletonNode                   1   * ROOT
  2   EnumerateCollectionNode   1000000     - FOR doc IN test   /* full collection scan */
  3   CalculationNode           1000000       - LET #1 = (doc.`value` == 4711)   /* simple expression */ 
  4   FilterNode                1000000       - FILTER #1
  5   ReturnNode                1000000       - RETURN doc

Indexes used:
 none

This shows that no indexes will be used when executing this query, and that the optimizer has picked a full collection scan to look at all the documents in the collection (note: this is indicated by EnumerateCollectionNode in the execution plan above. EnumerateCollectionNode is the name of some component that scans all the documents in a collection). As the collection contains 1,000,000 documents, this query is almost guaranteed to be slow without a proper index.

To make this query perform better, an index on value is clearly needed. Candidate index types for the new index are hash or skiplist.

For MMFiles, the difference between these two types is huge, as a hash index will have amortized complexity of O(1) for all operations, whereas a skiplist index has amortized complexity of O(log n). That means the skiplist index will be much more expensive than a hash index in MMFiles.
For the RocksDB engine, both index types can be used interchangeably.

As a general rule, we should use a hash index if we are certain all queries will look up value by equality (as above). If that is not known in advance, or if there will be queries that look up value using range lookups, or that sort by value, we may be better off with a skiplist index at least for MMFiles.

The next decision for the new index is if it can be made unique (which may slightly improve lookup performance) or not. In this case we know that value is not going to be unique in that collection so we need to pick a non-unique index. We also know that all documents in that collection do contain a non-null value attribute, so we will not benefit from a sparse index here at all.

By creating a non-unique, non-sparse hash index on value, the same query’s execution plan greatly improves to just:


Query string:
 FOR doc IN test FILTER doc.value == 4711 RETURN doc

Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  6   IndexNode        200     - FOR doc IN test   /* hash index scan */
  5   ReturnNode       200       - RETURN doc

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields        Ranges
  6   hash   test         false    false         0.50 %   [ `value` ]   (doc.`value` == 4711)

with an estimated 200 documents to look at instead of 1,000,000. That is an improvement by several orders of magnitude.

Use case: adding a combined index

Assume we have a query like the following, which will try to find documents for a given category and with timestamps that fall into a specific time period:


FOR doc IN test 
  FILTER doc.category == 'test2' 
  FILTER doc.timestamp > DATE_NOW() - 100000 AND doc.timestamp < DATE_NOW() 
  RETURN doc

Now it is possible to create a hash index on the category `attribute` and a skiplist index on timestamp.

This will bring up an execution plan that only uses one of the indexes we just created:


Execution plan:
 Id   NodeType           Est.   Comment
  1   SingletonNode         1   * ROOT
  8   IndexNode         10000     - FOR doc IN test   /* hash index scan */
  5   CalculationNode   10000       - LET #3 = ((doc.`timestamp` > (DATE_NOW() - 100000)) && (doc.`timestamp` < DATE_NOW()))  
  6   FilterNode        10000       - FILTER #3
  7   ReturnNode        10000       - RETURN doc

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields           Ranges
  8   hash   test         false    false         0.01 %   [ `category` ]   (doc.`category` == "test2")

Why doesn’t the query use both indexes?

In general, the query optimizer will pick one index per FOR loop in an AQL query and use it. Even though there are two indexes available, it is not guaranteed that using the two index would produce a better result in terms of execution speed.
So the optimizer will pick the one index for which it thinks the less documents will need to be inspected and returned.

Still this can be improved. Instead of creating individual indexes on categoryand timestamp, it would be much better here to create a combined skiplist index on category and timestamp (in this order).

With that index, the execution plan get simplified to


Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  8   IndexNode        666     - FOR doc IN test   /* skiplist index scan */
  7   ReturnNode       666       - RETURN doc

Indexes used:
 By   Type       Collection   Unique   Sparse   Selectivity   Fields                        Ranges
  8   skiplist   test         false    false        99.94 %   [ `category`, `timestamp` ]   ((doc.`category` == "test2") && (doc.`timestamp` > (DATE_NOW() - 100000)) && (doc.`timestamp` < DATE_NOW()))

which will likely correspond to far less documents that need to be inspected and thus much better query execution performance. As a bonus, that combined index can also be used for filtering on category alone, for sorting on category, or for sorting on category and timestamp.

For example, if our query is


FOR doc IN test 
  FILTER doc.category == 'test2' 
  FILTER doc.timestamp > DATE_NOW() - 100000 AND doc.timestamp < DATE_NOW()  
  SORT doc.timestamp DESC 
RETURN doc

the execution plan of it will not be any more complex, thanks to the combined index that can handle the sort:


Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
 10   IndexNode        666     - FOR doc IN test   /* reverse skiplist index scan */
  9   ReturnNode       666       - RETURN doc

Indexes used:
 By   Type       Collection   Unique   Sparse   Selectivity   Fields                        Ranges
 10   skiplist   test         false    false        99.94 %   [ `category`, `timestamp` ]   ((doc.`category` == "test2") && (doc.`timestamp` > (DATE_NOW() - 100000)) && (doc.`timestamp` < DATE_NOW()))

Note: sorting on timestamp here is only possible because we are filtering on a constant category in a way, so we will only see a single category value being returned by this query. The query optimizer is smart enough to detect this and will use the index for sorting even if category was not specified in the SORT clause.

One vs. multiple indexes

A query can use more than one index per collection, but for each FOR loop in an AQL query the optimizer will use at most one index if the filter conditions are combined with logical AND. The optimizer will try to pick the index that promises to reduce work (filtering, sorting) by the greatest amount. If there are multiple candidate indexes, they will all be looked at, but the one with the least expected cost will be chosen. Therefore it often makes sense to use combined indexes.

In case there are filter conditions on different attributes combined with logical OR, the optimizer may pick more than one index even for the same FOR loop.

Things to avoid

Creating too many indexes

Creating extra indexes comes with a cost, consisting of storage costs and runtime overhead.

The MMFiles engine has in-memory indexes, so every index that is created will hog a fraction of the scarce RAM. For the RocksDB engine the indexes will be stored on disk, so they will take up disk space. Additionally, even though the RocksDB engine’s indexes will not be built up in RAM, they may still use some RAM because index values may land in some in-memory caches too.

Regarding runtime overhead, here is the list of situations when indexes have an effect on performance:

The index is initially created, and all documents from the collections are inserted into the index. This will take time proportional to the number of documents in the collection, and the runtime will also depend on the algorithmic complexity of the index internals.
On every document insert, update, replace or remove operation the index needs to be changed as well. This will extend the time required for all document operations that affect the index.
The collection is loaded again after a server restarted, and the in-memory index is built up from the on-disk document data (note: this is relevant for the MMFiles engine only, it will not happen for the RocksDB engine as it has all its indexes on disk and does not build them up in memory on collection load).

Therefore one only wants to create the indexes that are actually required. It makes no sense to index every existing attribute, because that is guaranteed to slow down writes, but at the same time has only a questionable effect on the performance of lookup queries.

Often it makes most sense to index the attributes that are used in the majority of queries, so that also the majority of queries improves. It is even more useful to find combinations of attributes that are used together in several queries, and then create combined indexes on them.

When in doubt, one should measure the effect of extra indexes on queries and data-modification operations, by using explain, and by tracking query execution times in the client applications.

However, not using indexes where they could greatly improve the situation should also be avoided for obvious reasons.

Using Function calls on index attributes

In contrast to some other databases, ArangoDB does not provide any indexes that can store functions results or other evaluated expression values. Using a function on an indexed document attribute in a FILTER condition will very likely render the index useless for that query.

For example, assume there is an index present on attribute value in collection test. The following query will try to compare all values from the index in their lower-cased variant against a constant search string. However, it will not use the index, because the index cannot provide the result of the LOWER AQL function. So instead the query will do a full collection scan:


FOR doc IN test FILTER LOWER(doc.value) == 'bart' RETURN doc

Execution plan:
 Id   NodeType                    Est.   Comment
  1   SingletonNode                  1   * ROOT
  2   EnumerateCollectionNode   100000     - FOR doc IN test   /* full collection scan */
  3   CalculationNode           100000       - LET #1 = (LOWER(doc.`value`) == "bart") 
  4   FilterNode                100000       - FILTER #1
  5   ReturnNode                100000       - RETURN doc

Indexes used:
 none

To make this query use an index on attribute value, there are two options, which both require slight changes to the data model.

if possible, only store value values as completely lower-cased. Then it is only required to convert all lookup values to lowercase.
add an extra attribute lowercasedValue to the collection to store the lower-cased version of value, solely for the purpose of using an index for it. The index then needs to be created on lowercasedValue and not value.

An example for the latter follows:


FOR doc IN test FILTER doc.lowercasedValue == LOWER('bart') RETURN doc

Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  6   IndexNode          1     - FOR doc IN test   /* hash index scan */
  5   ReturnNode         1       - RETURN doc

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields                  Ranges
  6   hash   test         false    false        99.99 %   [ `lowercasedValue` ]   (doc.`lowercasedValue` == "bart")

Creating sparse indexes

In some cases declaring an index sparse will prevent the query optimizer from using it.

For example, this two-collection join will turn into a nightmare with sparse indexes.


FOR doc1 IN test1 FOR doc2 IN test2 FILTER doc1.value == doc2.value RETURN doc1

Execution plan:
 Id   NodeType                  Est.   Comment
  1   SingletonNode                1   * ROOT
  3   EnumerateCollectionNode      0     - FOR doc2 IN test2   /* full collection scan */
  2   EnumerateCollectionNode      0       - FOR doc1 IN test1   /* full collection scan */
  4   CalculationNode              0         - LET #2 = (doc1.`value` == doc2.`value`)  
  5   FilterNode                   0         - FILTER #2
  6   ReturnNode                   0         - RETURN doc1

The reason for not using the indexes here is that in each collection there may be documents that do not have a value attribute, or contain a value attribute with a value of null. If the optimizer would be using any of the sparse indexes on value, it would not find these documents (as they would not have been indexed). So without using the indexes the query would produce more results than with using the indexes. This is a situation that the optimizer must avoid, so it will discard the sparse indexes and go on without. Note that it would still use non-sparse indexes here if present, but the point here is to show that declaring an index sparse can have undesired effects for some queries.

So an index should only be declared sparse when all queries using that attribute have been checked and it was found that the index can still be used for them even if sparse.

What’s next

We are currently working on ArangoSearch (coming with ArangoDB 3.4) which will be a superior replacement for the existing fulltext index. It offers relevance ranking, allows indexing of multiple attributes and supports multiple languages (including Chinese & Russian). You can already try it out by downloading a 3.4 Milestone and going through ArangoSearch tutorial.

We are also working on significantly extending and improving geo index functionality coming with ArangoDB 3.4. It will offer more advanced querying functionality, performance optimizations and support for standard GeoJSON types (polygons, multi-polygons, polylines, intersections, etc.).

Join our Community Slack to give us feedback, improvement suggestions, if you have any questions or just to chat with us.

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

March 1 2018,Jan Steemann

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

Index types and how indexes are used in ArangoDB: Part II

Adding indexes to the data model

Use case: adding a hash index to avoid a full collection scan

Use case: adding a combined index

One vs. multiple indexes

Things to avoid

Creating too many indexes

Using Function calls on index attributes

Creating sparse indexes

Jan Steemann

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch