SmartJoins - Making distributed Join Operations Local
When sharding data collections to a cluster, running JOIN operations across data residing on different machines might require large amounts of data movements and network traffic and hence suboptimal performance.
SmartJoins are a solution for running fast distributed JOIN operations against sharded collections by utilizing a smart sharding scheme allowing JOIN operations with minimal network traffic.
By using SmartJoins, data processing is brought to the servers where the data actually resides – the DBservers in ArangoDB – and execute the operation in parallel & locally. This approach allows to achieve the performance close to a single instance in a cluster setting. For data warehousing, reporting and business intelligence use cases SmartJoins can be a perfect solution for gaining insights from huge datasets.
How to benefit from SmartJoins
To leverage SmartJoins, Enterprise Edition users just have to shard two collections with the same sharding setup.
By having the same number of shards and using the _key
attribute by which join operations are performed as sharding key, the ArangoDB SmartJoins feature makes sure that data needed for JOIN operations reside on the same machine and can, therefore, be executed locally.
A good example are product orders which include e.g. a products
and an orders
collections. The _key attribute in products
can be used to shard the products collection. Within the orders
collection the productId
is a foreign key referencing to the _key
attribute in the products
collection and by using this attribute as the sharing key for the orders
collection guarantees that orders and products are sharded alike and SmartJoins can be used.
db._create("products", { numberOfShards: 3, shardKeys: ["_key"] }); db._create("orders", { distributeShardsLike: "products", shardKeys: ["productId"] });
After sharding collections with the SmartJoin approach, the query engine on the Coordinator automatically sends incoming queries to query engines on the DBservers for local query execution. This happens in parallel, when multiple shards are needed for this particular query. The intermediate results from the DBservers get then send to the Coordinator, where the final result set gets put together and send to the client (see schema below).
Collections Sharded with SmartJoins Approach in ArangoDB. Note that both the M and B collections are sharded based on the same join key and hence joins between them can be performed locally.
For taking a closer look at SmartJoins, please see the hands-on tutorial or see more details in the SmartJoin docs.
Enhanced Distributed JOIN Capabilities with SmartJoins and SatelliteCollections
SatelliteCollections is another feature in ArangoDB to accelerate distributed JOIN operations. You can define a collection to be sharded and other smaller collections can be replicated to each machine for allowing local join operations among the shards and replicated collections.
But SatelliteCollections can also be combined with SmartJoins to allow enhanced join operations spanning two sharded collections and multiple replicated Satellites.
Imagine we would add a user
collection to the product purchase example outlined above. This user collection is typically much smaller compared to the product
or order
collections and hence would be a good candidate to be replicated to each DB server. In such cases, joins across all three collections could be performed locally thanks to the combination of SmartJoins (product and order
collections) and SatelliteCollections (user
collections).
Check out our SmartJoins video tutorial below to see how it works and how to combine both features for optimal performance.