Efficient Data Collection with Hash Tables: ArangoDB Insights
ArangoDB 2.6 will feature an alternative hash implementation of the AQL COLLECT
operation. The new implementation can speed up some AQL queries that can not exploit indexes on the COLLECT
group criteria.
This blog post provides a preview of the feature and shows some nice performance improvements. It also explains the COLLECT
-related optimizer parts and how the optimizer will decide whether to use the new or the traditional implementation.
Introduction to COLLECT
A quick recap: in AQL, the COLLECT
operation can be used for grouping and optionally counting values.
Here’s an example, using flight data:
FOR flight IN flights
COLLECT from = flight._from WITH COUNT INTO count
RETURN { from: from, count: count }
This query will iterate over all documents in collection flights
, and count the number of flights per different _from
value (origin airport). The query result will contain only unique from
values plus a counter for each:
[
{ "from" : "airports/ABE", "count" : 6205 },
{ "from" : "airports/ABQ", "count" : 39346 },
{ "from" : "airports/ACV", "count" : 362 },
...
{ "from" : "airports/YAP", "count" : 285 },
{ "from" : "airports/YKM", "count" : 879 },
{ "from" : "airports/YUM", "count" : 2275 }
]
As the COLLECT
will group its result according to the specified group criteria (flights._from
in the above query), it needs a way of figuring out to which group any input value does belong.
Before ArangoDB 2.6, there was a single method for determining the group. Starting with ArangoDB 2.6, the query optimizer can choose between two different COLLECT
methods, the sorted method and the hash method.
Sorted COLLECT method
The traditional method for determining the group values is the sorted method. It has been available in ArangoDB since the very start.
The sorted method of COLLECT
requires its input to be sorted by the group criteria specified in the COLLECT
statement. Because there is no guarantee that the input data are already sorted in the same way, the query optimizer will automatically insert a SORT
statement into the query in front of the COLLECT
. In case there is a sorted index present on the group criteria attributes, the optimizer may be able to optimize away the SORT
again. If there is no sorted index present on the group criteria attributes, the SORT
will remain in the execution plan.
Here is the execution plan for the above query using the sorted method of COLLECT
. We can see the extra SortNode
with id #7 being added by the optimizer in front of the COLLECT
:
The sorted method of COLLECT
is efficient because it can write out a group result whenever an input value will start a new group. Therefore it does not need to keep the whole COLLECT
result in memory. The downside of using the sorted method is that it requires its input to be sorted, and that this requires adding an extra SORT
for not properly sorted input.
Hash COLLECT method
Since ArangoDB 2.6, the query optimizer can also employ the hash method for COLLECT
. The hash method works by assigning the input values of the COLLECT
to slots in a hash table. It does not require its input to be sorted. Because the entries in the hash table do not have a particular order, the query optimizer will add a post-COLLECT
SORT
statement. With this extra sort of the COLLECT
result, the optimizer ensures that the output of the sorted COLLECT
will be the same as the output of the hash COLLECT
.
Here is the execution plan for the above query when using the hash method of COLLECT
. Here we can see the extra SortNode
with id #7 being added post-COLLECT
:
The hash method is beneficial because it does not require sorted input and thus no extra SORT
step in front. However, as the input is not sorted, it is never clear when a group is actually finished. The hash method therefore needs to build the whole COLLECT
result in memory until the input is exhausted. Then it can safely write out all group results. Additionally, the result of the hash COLLECT
is unsorted. Therefore the optimizer will add a post-COLLECT
sort to ensure the result will be identical to a sorted COLLECT
.
Which method will be used when?
The query optimizer will always take the initial query plan and specialize its COLLECT
nodes to using the sorted method. It will also add the pre-COLLECT
SORT
in the original plan.
In addition, for every COLLECT
statement not using an INTO
clause, the optimizer will create a plan variant that uses the hash method. In that plan variant, the post-COLLECT
SORT
will be added. Note that a WITH COUNT INTO
is still ok here, but that using a regular INTO
clause will disable the usage of the hash method:
FOR flight IN flights
COLLECT from = flight._from INTO allFlights
RETURN { from: from, flights: allFlights }
If more than one COLLECT
method can be used for a query, the created plans will be shipped through the regular optimization pipeline. In the end, the optimizer will pick the plan with the lowest estimated total cost as it will do for all other queries.
The hash variant does not require an up-front sort of the COLLECT
input, and will thus be preferred over the sorted method if the optimizer estimates many input elements for the COLLECT
and cannot use an index to process them in already sorted order. In this case, the optimizer will estimate that post-sorting the result of the hash COLLECT
will be more efficient than pre-sorting the input for the sorted COLLECT
.
The main assumption behind this estimation is that the result of any COLLECT
statement will contain at most as many elements as there are input elements to it. Therefore, the output of a COLLECT
is likely to be smaller (in terms of rows) than its input, making post-sorting more efficient than pre-sorting.
If there is a sorted index on the COLLECT
group criteria that the optimizer can exploit, the optimizer will pick the sorted method because thanks to the index it can optimize away the pre-COLLECT
sort, leaving no sorts left in the final execution plan.
To override the optimizer decision, COLLECT
statements now have an OPTIONS
modifier. This modifier can be used to force the optimizer to use the sorted variant:
FOR flight IN flights
COLLECT from = flight._from WITH COUNT INTO count OPTIONS { method: "sorted" }
RETURN { from: from, count: count }
Note that specifying hash in method
will not force the optimizer to use the hash method. The reason is that the hash variant cannot be used for all queries (only COLLECT
statements without an INTO
clause are eligible). If OPTIONS
are omitted or any other method than sorted
is specified, the optimizer will ignore it and use its regular cost estimations.
Understanding execution plans
Which method is actually used in a query can found out by explaining it and looking at its execution plan.
A COLLECT
is internally handled by an object called AggregateNode
, so we have to look for that. In the above screenshots, the AggregateNode
s are tagged with either hash or sorted. This can also be checked programatically by looking at the aggregationOptions.method
attributes in the JSON result of an explain().
Here is some example code to extract this information, limited to the AggregateNode
s of the query already:
var query = `
FOR flight IN flights
COLLECT from = flight._from WITH COUNT INTO count
RETURN { from: from, count: count }
`;
var stmt = db._createStatement(query);
var plan = stmt.explain().plan;
plan.nodes.filter(function(node) {
return node.type === 'AggregateNode';
});
For the above query, this will produce something like this:
[
{
"type" : "AggregateNode",
...
"aggregationOptions" : {
"method" : "hash"
}
}
]
Here we can see that the query is using the hash method.
Optimizing away post-COLLECT sorts
If a query uses the hash method for a COLLECT
but the sort order of the COLLECT
result is irrelevant to the user, the user can provide a hint to the optimizer to remove the post-COLLECT
sort.
This can be achieved by simplying appending a SORT null
to the original COLLECT
statement. Here we can see that this removes the post-COLLECT
sort:
Performance improvements
The improvements achievable by using the hash method instead of the sorted method obviously depend on whether there are appropriate indexes present for the group criteria. If an index can be exploited, the sorted method may be just fine. However, there are cases when no indexes are present, for example, when running arbitrary ad-hoc queries or when indexes are too expensive (indexes need to be updated on insert/update/remove and also will use memory).
Following are a few comparisons of the sorted and the hash methods in case no indexes can be used.
Here’s the setup for the test data. This generates 1M documents with both unique and repeating string and numeric values. For the non-unique values, we’ll use 20 different categories:
var test = db._create("test");
for (var i = 0; i < 1000000; ++i) {
test.insert({
uniqueNumber: i,
uniqueString: String("test" + i),
repeatingNumber: (i % 20),
repeatingString: String("test" + (i % 20))
});
}
Now let’s run the following query on the data and measure its execution time:
FOR v IN test
COLLECT value = v.@attribute WITH COUNT INTO count
RETURN { value: value, count: count }
The worst case is when the COLLECT
will produce as many output rows as there are input rows. This will happen when using a unique attribute as the grouping criterion. We’ll run tests on both numeric and string values.
Here are the execution times for unique inputs. It can be seen that the hash method here will be beneficial if the post-COLLECT
sort can be optimized away. As demonstrated above, this can be achieved by adding an extra SORT null
after the COLLECT
statement. If the post-COLLECT
sort is not optimized away, it will make the hash method a bit more expensive than the sorted method:
collect method @attribute duration
-------------------------------------------------------
sorted uniqueNumber 11.92 s
hash uniqueNumber 13.40 s
hash (sort null) uniqueNumber 10.13 s
sorted uniqueString 22.04 s
hash uniqueString 27.35 s
hash (sort null) uniqueString 12.12 s
Now let’s check the results when we group on an attribute that is non-unique. Following are the results for numeric and string attributes with 20 different categories each:
collect method @attribute duration
-------------------------------------------------------
sorted repeatingNumber 5.56 s
hash repeatingNumber 0.94 s
hash (sort null) repeatingNumber 0.94 s
sorted repeatingString 10.56 s
hash repeatingString 1.09 s
hash (sort null) repeatingString 1.09 s
In these cases, the result of the COLLECT
will be much smaller than its input (we’ll only get 20 result rows out instead of 1M). Therefore the post-COLLECT
sort for the hash method will not make any difference, but the pre-COLLECT
sort for the sorted method will still need to sort 1M input values. This is also the reason why the hash method is significantly faster here.
As usual, your mileage may vary, so please run your own tests.
7 Comments
Leave a Comment
Get the latest tutorials, blog posts and news:
Will there be a performance gain if you use WITH COUNT INTO and sort by count on repeating numbers and strings?
It is definitely possible though not guaranteed. I’ll try to explain when to expect a speedup and when not.
I understood the goal is to sort by the count values. Then the COLLECT will require a post-SORT by count in the AQL query. This SORT cannot be optimized away and it cannot use any index, because its input values are generated during the grouping phase of the `COLLECT`.
However, if the input values to the COLLECT repeat a lot (as you mentioned), the output of the COLLECT should be much smaller than its input. And the smaller the output of the COLLECT is, the less expensive a SORT on it will become.
That means if your query is something like `FOR doc IN collection COLLECT value = doc.value WITH COUNT INTO count SORT count DESC RETURN { value: value, count: count }` and there are many repeating values in `doc.value` and no sorted index on this attribute, the query can employ the hash table collect and run faster than before, when it required the input of the COLLECT to be sorted by `doc.value`. This sorting would have been the most expensive part of the query, and the goal of the hash table collect is to avoid it. So yes, here you should see a speedup.
When there is already a sorted index on the group criterion (i.e. `doc.value` in the previous query), then the optimizer will likely use it to avoid the pre-COLLECT sorting altogether. In this case, it won’t use the hash table collect and query execution time shouldn’t change too much (at least not directly related to COLLECT and SORT, we have applied some other optimizations elsewhere that might still speed it up).
Finally, if after a COLLECT there are as much output values as there are input values, then sorting by the final count values will still be expensive (though unavoidable). But you might still save the pre-COLLECT sort due to the optimizer using the hash table version of COLLECT if there is no index on the group criterion.
Thanks Jan for the in-depth explanation! I should definitely see an improved performance in my case, because there were no indexes at all.
I aggregated all fields taken from a RDBMS, so flat documents with a limited number of fields per collection, but without indexes, since the generation of them would had taken a lot of time and memory (and a script to set them all up, >10k). The time consumption was still acceptable, but should be a matter of seconds instead of hours now.
Fingers crossed! Please let us know if performance of your queries improves with ArangoDB 2.6. Thanks!
Will there be a 2.6 beta build for windows anytime soon? I’d like to test it sooner than later, but I’m a bit afraid of the windows compilation process. Linux builds are way easier, but would require the source data to be available inside the vm if I’m not mistaken.
I just re-ran my script to aggregate all fields with a minor change to the AQL query to utilize WITH COUNT INTO and on a slightly different data source (negligible though), and it’s 120x faster!!! (from 8 hours down to 4 minutes). I did/do not use indexes, since I would need them on every attribute on all hundred collections. This is an awesome improvement. Thanks, Jan!
Thanks for the feedback. A 120x improvement looks good! Props also go to @weinberger for suggesting the hash collect method at all!