Sort-Limit Optimization in AQL

March 7 2019,/General

Sometimes we want sorted output from a query and, for whatever reason, cannot use an index to do the sorting. In ArangoDB, we already cover this critical case with finely tuned query execution code. Sometimes though, we do not need to return all output, and follow our SORT clause with a LIMIT. In ArangoDB 3.4 and earlier, we did not handle this case any differently from returning the full data, at least with respect to sorting – we would sort the full input, then apply the limit afterwards.

In many cases though, it is possible to maintain just a reduced set of data in memory, equivalent in size to that required by the LIMIT clause, using a special data structure called a size-constrained heap. We have just added the “sort-limit” optimizer rule in our development branch, for our eventual 3.5 release, to dramatically reduce memory usage and optimize performance in these special cases.

Example Sort-Limit query

Consider the following simple example.

FOR doc IN records
  SORT doc.value
  LIMIT 100, 10
  RETURN doc

Suppose that the records collection holds millions of documents, and we do not have an index over the value attribute. Without the “sort-limit” optimization, we would read all documents from records into main memory, sort them, then apply the limit to skip the first 100 and return the next 10. That seems pretty wasteful! Instead, suppose we modify our scanning process. As we read all the documents from records, we really just need to keep track of first 110 documents in sort order. We can accomplish this with the size-constrained heap data structure. Once we have those 110 documents, we can simply sort them as usual, and apply the limit/offset to this reduced data set.

This saves us a great deal of memory usage, and can in fact save us quite a bit of time as well. Suppose we run essentially the basic query above with a few different configurations as a benchmark.

Performance improvement results

First, let’s set up our collection with the following query:

FOR i IN 1..@collectionSize
  INSERT {value: RAND()}
  INTO records

Then let’s test the lookup performance with this query:

FOR doc IN records
  SORT doc.value
  LIMIT @limit
  RETURN doc

Tested on a developer laptop with a 7th-gen Core i7, we get the following results.

Note that as the number of items retained approaches the total number of input elements, the speedup drops, but we still get considerable memory savings. As the number gets even closer to the input, we may actually get some small slowdown. In the cases where the memory savings will still be significant, we the optimizer may choose to enable the “sort-limit” rule even if there will be a small slowdown. If the memory savings will be trivial, however, the optimizer will not apply the “sort-limit” rule, in order to avoid such a slowdown.

If you experience an undesired slowdown on a particular query, and do not care about the memory savings, you may choose to manually override the optimizer decision by disabling the “sort-limit” rule.

Feel free to check out the optimization for yourself with one of our recent nightly Docker builds, or look forward to improved performance with our 3.5 release!

Jan Steemann

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

March 7 2019,Jan Steemann

Fireside Chat – Powering GenAI: The Critical Foundations for Scale. Watch Now

Sort-Limit Optimization in AQL

Example Sort-Limit query

Performance improvement results

Jan Steemann

Leave a Comment Cancel Reply

Tags

Quick Links

Info

About Us

Stay In Touch