When and how to use sparse indexes in ArangoDB 2.5

This version is deprecated. Download the new version of ArangoDB

In ArangoDB 2.5, hash and skiplist indexes can optionally be made sparse.

Such sparse indexes will exclude documents in which at least one of the index attributes is either not set or has a value of null. Declaring an index as sparse can provide great savings in memory and index creation CPU time for some cases.

For example, the following document only has attributes _key and name:

{ "_key" : "test", "name" : "foobar" }

In a sparse index on (non-existing) attribute value, the above document will simply be ignored. And in a non-sparse index, it will be inserted with a value of null for attribute value.

Deliberately excluding null values from sparse indexes has a few benefits, which may or may not be reaped depending on how the indexes are used:

  • documents which are excluded from the index will not cost index memory
  • with smaller indexes, a few memory allocations may be saved
  • smaller indexes can improve cache usage
  • less comparisons may be required to find a desired index entry

Let’s have a look at some real-world figures, comparing sparse and non-sparse indexes in ArangoDB.

Hash indexes

Let’s start with hash indexes. The following tables provide the index sizes and the index creation times for non-unique hash indexes. The index was created on a collection with 4 million documents. There was one indexed attribute, and it contained numeric values in a varying amount of documents of the collection. The columns have the following meanings:

  • set: number of documents in which the index attribute was set to a numeric value
  • not set: number of documents in which the index attribute was missing
  • index size: total size of index (in bytes)
  • time: total time required to build the index (in seconds)

ArangoDB 2.4, non-sparse, non-unique hash index

set not set index size time
4,000,000 0 448,000,232 4.807
2,000,000 2,000,000 272,362,672 3.088
1,000,000 3,000,000 184,348,800 2.327
400,000 3,600,000 174,504,576 2.110
0 4,000,000 168,334,976 2.032

As can be seen, the memory usage of non-unique hash indexes already correlated somewhat to the number of documents that actually contained the index attribute. Though there was no linear correlation. The memory usage of the hash index was determined by two factors: the number of distinct index values (for which memory was allocated in large chunks) and the number of duplicates (for which memory was allocated in small chunks).

ArangoDB 2.5, non-sparse, non-unique hash index

When looking at the same in 2.5, not much has changed (though the hash index seems to have got slightly faster):

set not set index size time
4,000,000 0 448,000,232 3.960
2,000,000 2,000,000 272,362,672 2.772
1,000,000 3,000,000 184,348,800 2.213
400,000 3,600,000 174,504,576 2.054
0 4,000,000 168,334,976 1.971

ArangoDB 2.5, sparse, non-unique hash index

Now let’s compare the above with sparse indexes:

set not set index size time
4,000,000 0 448,000,808 3.617
2,000,000 2,000,000 224,000,424 2.442
1,000,000 3,000,000 112,000,232 2.013
400,000 3,600,000 52,000,136 1.634
0 4,000,000 18,000,088 1.468

As can be seen, index memory usage now decreases much more drastically when the majority of the documents do not contain the index attribute. In the above comparison, the sparse index only uses about a tenth of the memory that the non-sparse index had used when created on a non-existing attribute (last row in each table). If only a tenth of the documents contain the index attribute, the sparse index variant can save about 70 % memory.

Overall, the cost of adding a sparse hash index to a collection with optional attributes can be much lower than for adding a non-sparse index.

Skiplist indexes

Following now are the same comparisons as above, but with using skiplist indexes instead of hash indexes.

ArangoDB 2.4, non-sparse, non-unique skiplist index

set not set index size time
4,000,000 0 320,022,640 32.997
2,000,000 2,000,000 320,014,088 34.437
1,000,000 3,000,000 320,018,256 34.730
400,000 3,600,000 319,971,736 35.772
0 4,000,000 320,007,952 35.691

As can be seen, the memory usage of non-unique skiplist indexes did not correlate to the number of documents that contain the index attribute. The memory usage is almost constant (modulo small fluctuations, as the skiplist is a probabilistic data structure).

ArangoDB 2.5, non-sparse, non-unique skiplist index

Same result for non-sparse skiplists in 2.5:

set not set index size time
4,000,000 0 320,024,944 31.903
2,000,000 2,000,000 320,014,456 33.263
1,000,000 3,000,000 320,005,592 34.777
400,000 3,600,000 320,014,720 34.904
0 4,000,000 320,005,408 35.665

ArangoDB 2.5, sparse, non-unique skiplist index

Now we are getting to the sparse variant of the skiplist index.

set not set index size time
4,000,000 0 320,033,720 31.720
2,000,000 2,000,000 159,988,904 14.734
1,000,000 3,000,000 80,016,112 7.351
400,000 3,600,000 32,006,128 3.695
0 4,000,000 520 1.527

The skiplist index is a structure that grows dynamically, and its memory usage grows almost linearly with the number of index entries. For a non-sparse skiplist index, the number of index entries is equal to the number of documents in the collection. But for a sparse index, all documents will be excluded from the index that do not contain the index attribute. Therefore, in the last row, the sparse skiplist index will use only 520 bytes and its overhead can be considered really low.

When to use sparse indexes

The above figures should have demonstrated that sparse indexes are good candidates when it comes to indexing optional attributes, i.e. attributes that are present in few or the minority of documents in a collection. They can help reduce index and overall memory usage, and can reduce index creation time.

However, nothing is for free, so sparse indexes must have downsides, right?

Right! As sparse indexes may exclude some documents, they cannot be used for every type of query. Sparse hash indexes cannot be used to find documents for which at least one of the indexed attributes contains a value of null. For example, the following AQL query cannot use a sparse index, even if one was created on attribute value:

FOR doc In collection 
  FILTER doc.value == null 
  RETURN doc

If a lookup value in a query is non-constant, a sparse index may or may not be used, depending on the other types of conditions in the query. If the optimizer can safely determine that the lookup value cannot be equal to null, a sparse index may be used. When uncertain, the optimizer will not make use of a sparse index in a query in order to produce correct results.

For example, the following queries cannot use a sparse index on value because the optimizer will not know beforehand whether the comparison values for doc.value will include null:

FOR doc In collection 
  FILTER doc.value == SOME_FUNCTION(...) 
  RETURN doc

FOR other IN otherCollection 
  FOR doc In collection 
    FILTER doc.value == other.value
    RETURN doc

Sparse skiplist indexes can be used for sorting if the optimizer can safely detect that the index range does not include null for any of the index attributes.

How to create sparse indexes

In order to create a sparse index, an object with the sparse attribute can be added to the index creation commands:

db.collection.ensureHashIndex(attributeName, { sparse: true }); 
db.collection.ensureHashIndex(attributeName1, attributeName2, { sparse: true }); 
db.collection.ensureUniqueConstraint(attributeName, { sparse: true }); 
db.collection.ensureUniqueConstraint(attributeName1, attributeName2, { sparse: true }); 

db.collection.ensureSkiplist(attributeName, { sparse: true }); 
db.collection.ensureSkiplist(attributeName1, attributeName2, { sparse: true }); 
db.collection.ensureUniqueSkiplist(attributeName, { sparse: true }); 
db.collection.ensureUniqueSkiplist(attributeName1, attributeName2, { sparse: true }); 

Note that in place of the above specialized index creation commands, it is recommended to use the more general index creation command ensureIndex:

db.collection.ensureIndex({ type: "hash", sparse: true, unique: true, fields: [ attributeName ] });
db.collection.ensureIndex({ type: "skiplist", sparse: false, unique: false, fields: [ "a", "b" ] });

When not explicitly set, the sparse attribute defaults to false for new indexes.

Geo and fulltext indexes are implicitly sparse, and there is no way and no need to declare them as sparse.

Jan Steemann

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

1 Comment

  1. CoDEmanX on February 26, 2015 at 2:01 pm

    Regarding the null-value limitation of sparse indices: server-side constraints would come in really handy to ensure there’s no attribute value with an explicit null (could be an empty string instead I guess, or not exist at all in most cases) – like a NEVER_NULL flag for collection attributes.

Leave a Comment

Get the latest tutorials, blog posts and news: