When and how to use sparse indexes in ArangoDB 2.5

February 24 2015,/Performance, Releases

In ArangoDB 2.5, hash and skiplist indexes can optionally be made sparse.

Such sparse indexes will exclude documents in which at least one of the index attributes is either not set or has a value of null. Declaring an index as sparse can provide great savings in memory and index creation CPU time for some cases.

For example, the following document only has attributes _key and name:

{ "_key" : "test", "name" : "foobar" }

In a sparse index on (non-existing) attribute value, the above document will simply be ignored. And in a non-sparse index, it will be inserted with a value of null for attribute value.

Deliberately excluding null values from sparse indexes has a few benefits, which may or may not be reaped depending on how the indexes are used:

documents which are excluded from the index will not cost index memory
with smaller indexes, a few memory allocations may be saved
smaller indexes can improve cache usage
less comparisons may be required to find a desired index entry

Let’s have a look at some real-world figures, comparing sparse and non-sparse indexes in ArangoDB.

Hash indexes

Let’s start with hash indexes. The following tables provide the index sizes and the index creation times for non-unique hash indexes. The index was created on a collection with 4 million documents. There was one indexed attribute, and it contained numeric values in a varying amount of documents of the collection. The columns have the following meanings:

set: number of documents in which the index attribute was set to a numeric value
not set: number of documents in which the index attribute was missing
index size: total size of index (in bytes)
time: total time required to build the index (in seconds)

ArangoDB 2.4, non-sparse, non-unique hash index

set	not set	index size	time
4,000,000	0	448,000,232	4.807
2,000,000	2,000,000	272,362,672	3.088
1,000,000	3,000,000	184,348,800	2.327
400,000	3,600,000	174,504,576	2.110
0	4,000,000	168,334,976	2.032

As can be seen, the memory usage of non-unique hash indexes already correlated somewhat to the number of documents that actually contained the index attribute. Though there was no linear correlation. The memory usage of the hash index was determined by two factors: the number of distinct index values (for which memory was allocated in large chunks) and the number of duplicates (for which memory was allocated in small chunks).

ArangoDB 2.5, non-sparse, non-unique hash index

When looking at the same in 2.5, not much has changed (though the hash index seems to have got slightly faster):

set	not set	index size	time
4,000,000	0	448,000,232	3.960
2,000,000	2,000,000	272,362,672	2.772
1,000,000	3,000,000	184,348,800	2.213
400,000	3,600,000	174,504,576	2.054
0	4,000,000	168,334,976	1.971

ArangoDB 2.5, sparse, non-unique hash index

Now let’s compare the above with sparse indexes:

set	not set	index size	time
4,000,000	0	448,000,808	3.617
2,000,000	2,000,000	224,000,424	2.442
1,000,000	3,000,000	112,000,232	2.013
400,000	3,600,000	52,000,136	1.634
0	4,000,000	18,000,088	1.468

As can be seen, index memory usage now decreases much more drastically when the majority of the documents do not contain the index attribute. In the above comparison, the sparse index only uses about a tenth of the memory that the non-sparse index had used when created on a non-existing attribute (last row in each table). If only a tenth of the documents contain the index attribute, the sparse index variant can save about 70 % memory.

Overall, the cost of adding a sparse hash index to a collection with optional attributes can be much lower than for adding a non-sparse index.

Skiplist indexes

Following now are the same comparisons as above, but with using skiplist indexes instead of hash indexes.

ArangoDB 2.4, non-sparse, non-unique skiplist index

set	not set	index size	time
4,000,000	0	320,022,640	32.997
2,000,000	2,000,000	320,014,088	34.437
1,000,000	3,000,000	320,018,256	34.730
400,000	3,600,000	319,971,736	35.772
0	4,000,000	320,007,952	35.691

As can be seen, the memory usage of non-unique skiplist indexes did not correlate to the number of documents that contain the index attribute. The memory usage is almost constant (modulo small fluctuations, as the skiplist is a probabilistic data structure).

ArangoDB 2.5, non-sparse, non-unique skiplist index

Same result for non-sparse skiplists in 2.5:

set	not set	index size	time
4,000,000	0	320,024,944	31.903
2,000,000	2,000,000	320,014,456	33.263
1,000,000	3,000,000	320,005,592	34.777
400,000	3,600,000	320,014,720	34.904
0	4,000,000	320,005,408	35.665

ArangoDB 2.5, sparse, non-unique skiplist index

Now we are getting to the sparse variant of the skiplist index.

set	not set	index size	time
4,000,000	0	320,033,720	31.720
2,000,000	2,000,000	159,988,904	14.734
1,000,000	3,000,000	80,016,112	7.351
400,000	3,600,000	32,006,128	3.695
0	4,000,000	520	1.527

The skiplist index is a structure that grows dynamically, and its memory usage grows almost linearly with the number of index entries. For a non-sparse skiplist index, the number of index entries is equal to the number of documents in the collection. But for a sparse index, all documents will be excluded from the index that do not contain the index attribute. Therefore, in the last row, the sparse skiplist index will use only 520 bytes and its overhead can be considered really low.

When to use sparse indexes

The above figures should have demonstrated that sparse indexes are good candidates when it comes to indexing optional attributes, i.e. attributes that are present in few or the minority of documents in a collection. They can help reduce index and overall memory usage, and can reduce index creation time.

However, nothing is for free, so sparse indexes must have downsides, right?

Right! As sparse indexes may exclude some documents, they cannot be used for every type of query. Sparse hash indexes cannot be used to find documents for which at least one of the indexed attributes contains a value of null. For example, the following AQL query cannot use a sparse index, even if one was created on attribute value:

FOR doc In collection 
  FILTER doc.value == null 
  RETURN doc

If a lookup value in a query is non-constant, a sparse index may or may not be used, depending on the other types of conditions in the query. If the optimizer can safely determine that the lookup value cannot be equal to null, a sparse index may be used. When uncertain, the optimizer will not make use of a sparse index in a query in order to produce correct results.

For example, the following queries cannot use a sparse index on value because the optimizer will not know beforehand whether the comparison values for doc.value will include null:

FOR doc In collection 
  FILTER doc.value == SOME_FUNCTION(...) 
  RETURN doc

FOR other IN otherCollection 
  FOR doc In collection 
    FILTER doc.value == other.value
    RETURN doc

Sparse skiplist indexes can be used for sorting if the optimizer can safely detect that the index range does not include null for any of the index attributes.

How to create sparse indexes

In order to create a sparse index, an object with the sparse attribute can be added to the index creation commands:

db.collection.ensureHashIndex(attributeName, { sparse: true }); 
db.collection.ensureHashIndex(attributeName1, attributeName2, { sparse: true }); 
db.collection.ensureUniqueConstraint(attributeName, { sparse: true }); 
db.collection.ensureUniqueConstraint(attributeName1, attributeName2, { sparse: true }); 

db.collection.ensureSkiplist(attributeName, { sparse: true }); 
db.collection.ensureSkiplist(attributeName1, attributeName2, { sparse: true }); 
db.collection.ensureUniqueSkiplist(attributeName, { sparse: true }); 
db.collection.ensureUniqueSkiplist(attributeName1, attributeName2, { sparse: true });

Note that in place of the above specialized index creation commands, it is recommended to use the more general index creation command ensureIndex:

db.collection.ensureIndex({ type: "hash", sparse: true, unique: true, fields: [ attributeName ] });
db.collection.ensureIndex({ type: "skiplist", sparse: false, unique: false, fields: [ "a", "b" ] });

When not explicitly set, the sparse attribute defaults to false for new indexes.

Geo and fulltext indexes are implicitly sparse, and there is no way and no need to declare them as sparse.

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

February 24 2015,Jan Steemann

1 Comments

CoDEmanX on February 26 2015, at 2:01 pm

Regarding the null-value limitation of sparse indices: server-side constraints would come in really handy to ensure there’s no attribute value with an explicit null (could be an empty string instead I guess, or not exist at all in most cases) – like a NEVER_NULL flag for collection attributes.

Reply

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

When and how to use sparse indexes in ArangoDB 2.5

Hash indexes

ArangoDB 2.4, non-sparse, non-unique hash index

ArangoDB 2.5, non-sparse, non-unique hash index

ArangoDB 2.5, sparse, non-unique hash index

Skiplist indexes

ArangoDB 2.4, non-sparse, non-unique skiplist index

ArangoDB 2.5, non-sparse, non-unique skiplist index

ArangoDB 2.5, sparse, non-unique skiplist index

When to use sparse indexes

How to create sparse indexes

Jan Steemann

1 Comments

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch