SmartJoins: boosting cluster join query performance
The “smart joins” feature available in the ArangoDB Enterprise Edition allows running joins between two sharded collections with performance close to that of a local join operation.
The prerequisite for this is that the two collections have an identical sharding setup.
See our feature video here:
The original Instacart dataset for this video is no longer available but can instead be downloaded from the instacart branch of the interactive tutorials repository.
DATASET: https://github.com/arangodb/interactive_tutorials/tree/instacart
Example setup
Let’s consider an example setup with collections, products and orders. Between these collections there is a 1-to-many relationship, e.g. for each product there can be any number of orders.
To use SmartJoins, the first step is to make the two collections shard their data alike. This can be achieved by making the distributeShardsLike
attribute of the orders collection refer to the products collection:
db._create("products", { numberOfShards: 3, shardKeys: ["_key"] }); db._create("orders", { distributeShardsLike: "products", shardKeys: ["productId"] });
Now both collections will have the same number of shards (3) each. The collection products will be sharded by the values in its _key
attribute, and data in collection orders will be sharded by the values in the attribute productId. That latter attribute productId
is a foreign key referring to the _key
in the products collection.
Now we can populate the collections with some example data:
db.products.insert({ _key: "abc-10001", name: "Hyper Booster" }); db.products.insert({ _key: "hlc-79452", name: "Holographic Horoscope" }); db.products.insert({ _key: "zxf-19414", name: "Nutty Nunchaku" }); db.orders.insert({ productId: "abc-10001", items: 2, date: "2019-02-17" }); db.orders.insert({ productId: "abc-10001", items: 1, date: "2019-03-21" }); db.orders.insert({ productId: "abc-10001", items: 4, date: "2019-03-25" }); db.orders.insert({ productId: "abc-10001", items: 1, date: "2019-03-26" }); db.orders.insert({ productId: "hlc-79452", items: 9, date: "2018-12-31" }); db.orders.insert({ productId: "hlc-79452", items: 1, date: "2019-01-01" }); db.orders.insert({ productId: "zxf-19414", items: 1, date: "2018-10-25" }); db.orders.insert({ productId: "zxf-19414", items: 2, date: "2019-03-12" }); db.orders.insert({ productId: "zxf-19414", items: 7, date: "2019-03-25" });
Data distribution
Due to the identical sharding setup all orders for a product with a given productId
value will be located on the same shard as the products they are referring to, which means joins between the two collections that join them via their shard keys can be executed locally.
Let’s check the actual per-shard data distribution of the collections:
db.products.count(true); { "s2010051" : 1, "s2010049" : 2, "s2010050" : 0 } db.orders.count(true); { "s2010055" : 2, "s2010053" : 7, "s2010054" : 0 }
It means that on the first shard of products (s2010051) a single product is stored. All orders for this product must be located on the corresponding shard of the orders collection, i.e. s2010055. The second shard of products (s2010049) hosts two products, and its counterpart in the orders collection is responsible for 7 orders for these two products. The third shard of the products collection is not responsible for any data yet, and so is its corresponding shard in the orders collection.
Running a smart join query
Before actually executing the first SmartJoin, let’s create a non-unique index on the productId
attribute of the orders collection, so any lookups by product id can be turned into quicker index lookups:
db.orders.ensureIndex({ type: "hash", fields: ["productId"] });
Here is the SmartJoin query, which is just an ordinary join query, with no extra hints added:
FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o
The AQL query optimizer will automatically detect that the join is via the two collections’ shard keys, and that the two collections are sharded equal:
db._explain("FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o"); Query String: FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o Execution plan: Id NodeType Site Est. Comment 1 SingletonNode DBS 1 * ROOT 3 EnumerateCollectionNode DBS 9 - FOR o IN orders /* full collection scan, 3 shard(s) */ 7 IndexNode DBS 0 - FOR p IN products /* primary index scan, scan only, 3 shard(s) */ 10 RemoteNode COOR 0 - REMOTE 11 GatherNode COOR 0 - GATHER 6 ReturnNode COOR 0 - RETURN o Indexes used: By Name Type Collection Unique Sparse Selectivity Fields Ranges 7 primary primary products true false 100.00 % [ `_key` ] (p.`_key` == o.`productId`) Optimization rules applied: Id RuleName 1 interchange-adjacent-enumerations 2 use-indexes 3 remove-filter-covered-by-index 4 remove-unnecessary-calculations-2 5 smart-joins 6 scatter-in-cluster 7 remove-unnecessary-remote-scatter
As can be seen from the above query execution plan, the query optimizer employed the “smart-joins” optimizer rule, which means it has turned a distributed join into a local join. Thus there is no extra hop via the coordinator necessary to join data of the collections in nodes 3 and 7.
Comparison to regular joins
Compare this to the following execution plan, which is from the same query, just without the SmartJoins optimization applied:
db._explain("FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o"); Query String: FOR p IN products FOR o IN orders FILTER p._key == o.productId RETURN o Execution plan: Id NodeType Site Est. Comment 1 SingletonNode DBS 1 * ROOT 16 IndexNode DBS 3 - FOR p IN products /* primary index scan, index only, projections: `_key`, 3 shard(s) */ 14 RemoteNode COOR 3 - REMOTE 15 GatherNode COOR 3 - GATHER 8 ScatterNode COOR 3 - SCATTER 9 RemoteNode DBS 3 - REMOTE 7 IndexNode DBS 3 - FOR o IN orders /* hash index scan, 3 shard(s) */ 10 RemoteNode COOR 3 - REMOTE 11 GatherNode COOR 3 - GATHER 6 ReturnNode COOR 3 - RETURN o Indexes used: By Name Type Collection Unique Sparse Selectivity Fields Ranges 16 primary primary products true false 100.00 % [ `_key` ] * 7 idx_1629018093411893248 hash orders false false 59.52 % [ `productId` ] (p.`_key` == o.`productId`) Optimization rules applied: Id RuleName 1 use-indexes 2 remove-filter-covered-by-index 3 remove-unnecessary-calculations-2 4 scatter-in-cluster 5 remove-unnecessary-remote-scatter 6 reduce-extraction-to-projection
Here we can see that between the two collections (nodes 16 and 7) there is an intermediate hop via the coordinator. This is also the general way of joining two collections with an arbitrary number of shards and unknown data distribution.
Performance comparison
The latter query (the one without SmartJoins) will require 15 network requests for joining the data of the two collections, in the simplest case. The value of 15 stems from these individual requests:
-
- coordinator to shard 1 of orders
- shard 1 of orders to the coordinator
- coordinator to shard 1 of products
- coordinator to shard 2 of products
- coordinator to shard 3 of products
- shard 1 of orders to the coordinator
- coordinator to shard 2 of orders
- shard 2 of orders to the coordinator
- coordinator to shard 1 of products
- coordinator to shard 2 of products
- coordinator to shard 3 of products
- shard 2 of orders to the coordinator
- coordinator to shard 3 of orders
- shard 3 of orders to the coordinator
- coordinator to shard 1 of products
- coordinator to shard 2 of products
- coordinator to shard 3 of products
- shard 3 of orders to the coordinator
- coordinator to shard 1 of orders
The SmartJoins do not need the extra coordinator hop, and do not need to fan out requests from each shard of one collection to each shard of the other collection. SmartJoins use the fact that the join data will always reside locally, so they do not reach out to the other shards via the network.
A lot of requests can be saved here! For the products and orders example query, the SmartJoin variant can get away with just 3 network requests:
- coordinator to shard 1 of orders and products
- coordinator to shard 2 of orders and products
- coordinator to shard 3 of orders and products
Generally spoken, the performance advantage of SmartJoins compared to regular joins will grow with the number of shards of the underlying collections.
For two collections with n
shards each, the minimal number of requests for the general join will be n * (n + 2)
. The number of network requests increases quadratically with the number of shards. SmartJoins can get away with a minimal number of n
requests here, which scales linearly with the number of shards.
SmartJoins will also be especially advantageous for queries that have to ship a lot of data around for performing the join, but that will filter out most of the data after the join. In this case SmartJoins should greatly outperform the general join, as they will eliminate most of the inter-node data shipping overhead.