When and how to use sparse indexes in ArangoDB 2.5
This version is deprecated. Download the new version of ArangoDB
In ArangoDB 2.5, hash and skiplist indexes can optionally be made sparse.
Such sparse indexes will exclude documents in which at least one of the index attributes is either not set or has a value of null
. Declaring an index as sparse can provide great savings in memory and index creation CPU time for some cases.
For example, the following document only has attributes _key
and name
:
{ "_key" : "test", "name" : "foobar" }
In a sparse index on (non-existing) attribute value
, the above document will simply be ignored. And in a non-sparse index, it will be inserted with a value of null
for attribute value
.
Deliberately excluding null
values from sparse indexes has a few benefits, which may or may not be reaped depending on how the indexes are used:
- documents which are excluded from the index will not cost index memory
- with smaller indexes, a few memory allocations may be saved
- smaller indexes can improve cache usage
- less comparisons may be required to find a desired index entry
Let’s have a look at some real-world figures, comparing sparse and non-sparse indexes in ArangoDB.
Hash indexes
Let’s start with hash indexes. The following tables provide the index sizes and the index creation times for non-unique hash indexes. The index was created on a collection with 4 million documents. There was one indexed attribute, and it contained numeric values in a varying amount of documents of the collection. The columns have the following meanings:
set
: number of documents in which the index attribute was set to a numeric valuenot set
: number of documents in which the index attribute was missingindex size
: total size of index (in bytes)time
: total time required to build the index (in seconds)
ArangoDB 2.4, non-sparse, non-unique hash index
set | not set | index size | time |
---|---|---|---|
4,000,000 | 0 | 448,000,232 | 4.807 |
2,000,000 | 2,000,000 | 272,362,672 | 3.088 |
1,000,000 | 3,000,000 | 184,348,800 | 2.327 |
400,000 | 3,600,000 | 174,504,576 | 2.110 |
0 | 4,000,000 | 168,334,976 | 2.032 |
As can be seen, the memory usage of non-unique hash indexes already correlated somewhat to the number of documents that actually contained the index attribute. Though there was no linear correlation. The memory usage of the hash index was determined by two factors: the number of distinct index values (for which memory was allocated in large chunks) and the number of duplicates (for which memory was allocated in small chunks).
ArangoDB 2.5, non-sparse, non-unique hash index
When looking at the same in 2.5, not much has changed (though the hash index seems to have got slightly faster):
set | not set | index size | time |
---|---|---|---|
4,000,000 | 0 | 448,000,232 | 3.960 |
2,000,000 | 2,000,000 | 272,362,672 | 2.772 |
1,000,000 | 3,000,000 | 184,348,800 | 2.213 |
400,000 | 3,600,000 | 174,504,576 | 2.054 |
0 | 4,000,000 | 168,334,976 | 1.971 |
ArangoDB 2.5, sparse, non-unique hash index
Now let’s compare the above with sparse indexes:
set | not set | index size | time |
---|---|---|---|
4,000,000 | 0 | 448,000,808 | 3.617 |
2,000,000 | 2,000,000 | 224,000,424 | 2.442 |
1,000,000 | 3,000,000 | 112,000,232 | 2.013 |
400,000 | 3,600,000 | 52,000,136 | 1.634 |
0 | 4,000,000 | 18,000,088 | 1.468 |
As can be seen, index memory usage now decreases much more drastically when the majority of the documents do not contain the index attribute. In the above comparison, the sparse index only uses about a tenth of the memory that the non-sparse index had used when created on a non-existing attribute (last row in each table). If only a tenth of the documents contain the index attribute, the sparse index variant can save about 70 % memory.
Overall, the cost of adding a sparse hash index to a collection with optional attributes can be much lower than for adding a non-sparse index.
Skiplist indexes
Following now are the same comparisons as above, but with using skiplist indexes instead of hash indexes.
ArangoDB 2.4, non-sparse, non-unique skiplist index
set | not set | index size | time |
---|---|---|---|
4,000,000 | 0 | 320,022,640 | 32.997 |
2,000,000 | 2,000,000 | 320,014,088 | 34.437 |
1,000,000 | 3,000,000 | 320,018,256 | 34.730 |
400,000 | 3,600,000 | 319,971,736 | 35.772 |
0 | 4,000,000 | 320,007,952 | 35.691 |
As can be seen, the memory usage of non-unique skiplist indexes did not correlate to the number of documents that contain the index attribute. The memory usage is almost constant (modulo small fluctuations, as the skiplist is a probabilistic data structure).
ArangoDB 2.5, non-sparse, non-unique skiplist index
Same result for non-sparse skiplists in 2.5:
set | not set | index size | time |
---|---|---|---|
4,000,000 | 0 | 320,024,944 | 31.903 |
2,000,000 | 2,000,000 | 320,014,456 | 33.263 |
1,000,000 | 3,000,000 | 320,005,592 | 34.777 |
400,000 | 3,600,000 | 320,014,720 | 34.904 |
0 | 4,000,000 | 320,005,408 | 35.665 |
ArangoDB 2.5, sparse, non-unique skiplist index
Now we are getting to the sparse variant of the skiplist index.
set | not set | index size | time |
---|---|---|---|
4,000,000 | 0 | 320,033,720 | 31.720 |
2,000,000 | 2,000,000 | 159,988,904 | 14.734 |
1,000,000 | 3,000,000 | 80,016,112 | 7.351 |
400,000 | 3,600,000 | 32,006,128 | 3.695 |
0 | 4,000,000 | 520 | 1.527 |
The skiplist index is a structure that grows dynamically, and its memory usage grows almost linearly with the number of index entries. For a non-sparse skiplist index, the number of index entries is equal to the number of documents in the collection. But for a sparse index, all documents will be excluded from the index that do not contain the index attribute. Therefore, in the last row, the sparse skiplist index will use only 520 bytes and its overhead can be considered really low.
When to use sparse indexes
The above figures should have demonstrated that sparse indexes are good candidates when it comes to indexing optional attributes, i.e. attributes that are present in few or the minority of documents in a collection. They can help reduce index and overall memory usage, and can reduce index creation time.
However, nothing is for free, so sparse indexes must have downsides, right?
Right! As sparse indexes may exclude some documents, they cannot be used for every type of query. Sparse hash indexes cannot be used to find documents for which at least one of the indexed attributes contains a value of null
. For example, the following AQL query cannot use a sparse index, even if one was created on attribute value
:
FOR doc In collection
FILTER doc.value == null
RETURN doc
If a lookup value in a query is non-constant, a sparse index may or may not be used, depending on the other types of conditions in the query. If the optimizer can safely determine that the lookup value cannot be equal to null
, a sparse index may be used. When uncertain, the optimizer will not make use of a sparse index in a query in order to produce correct results.
For example, the following queries cannot use a sparse index on value
because the optimizer will not know beforehand whether the comparison values for doc.value
will include null
:
FOR doc In collection
FILTER doc.value == SOME_FUNCTION(...)
RETURN doc
FOR other IN otherCollection
FOR doc In collection
FILTER doc.value == other.value
RETURN doc
Sparse skiplist indexes can be used for sorting if the optimizer can safely detect that the index range does not include null
for any of the index attributes.
How to create sparse indexes
In order to create a sparse index, an object with the sparse
attribute can be added to the index creation commands:
db.collection.ensureHashIndex(attributeName, { sparse: true });
db.collection.ensureHashIndex(attributeName1, attributeName2, { sparse: true });
db.collection.ensureUniqueConstraint(attributeName, { sparse: true });
db.collection.ensureUniqueConstraint(attributeName1, attributeName2, { sparse: true });
db.collection.ensureSkiplist(attributeName, { sparse: true });
db.collection.ensureSkiplist(attributeName1, attributeName2, { sparse: true });
db.collection.ensureUniqueSkiplist(attributeName, { sparse: true });
db.collection.ensureUniqueSkiplist(attributeName1, attributeName2, { sparse: true });
Note that in place of the above specialized index creation commands, it is recommended to use the more general index creation command ensureIndex
:
db.collection.ensureIndex({ type: "hash", sparse: true, unique: true, fields: [ attributeName ] });
db.collection.ensureIndex({ type: "skiplist", sparse: false, unique: false, fields: [ "a", "b" ] });
When not explicitly set, the sparse
attribute defaults to false
for new indexes.
Geo and fulltext indexes are implicitly sparse, and there is no way and no need to declare them as sparse.
1 Comments
Leave a Comment
Get the latest tutorials, blog posts and news:
Regarding the null-value limitation of sparse indices: server-side constraints would come in really handy to ensure there’s no attribute value with an explicit null (could be an empty string instead I guess, or not exist at all in most cases) – like a NEVER_NULL flag for collection attributes.