Three Ways to Scale your Graph
Estimated reading time: 10 minutes
As businesses grow and their data needs increase, they often face the challenge of scaling their database systems to keep up with the increasing demand.
What happens when your single server machine is no longer sufficient to store your graph that has grown too large? Or when your instance can no longer cope with the increasing amount of user requests coming in?
Sharding is the solution.
While you can vertically scale your resources on your single server machine by adding more resources, sharding allows you to horizontally scale your dataset throughout multiple machines.
What is Sharding?
Sharding is a technique used in database management systems to horizontally scale and partition data across multiple servers or nodes. The idea behind sharding is to divide a large dataset into smaller subsets called shards and store each shard on a different server. By doing this, the system can process and manage the data more efficiently and effectively.
Sharding is particularly useful when dealing with large datasets that cannot be stored on a single server due to hardware limitations or performance constraints. By dividing the data into smaller chunks and distributing them across multiple servers, sharding can help improve query performance, reduce response times, and enable systems to scale to handle large data volumes.
Advantages of Sharding in ArangoDB
The sharding aspect in ArangoDB offers several advantages, including:
- Improved Performance: By distributing data across multiple servers, ArangoDB can process queries faster, reduce response times and improve overall performance.
- Scalability: Sharding allows ArangoDB to scale horizontally by adding more servers to the cluster. This enables the system to handle larger datasets without compromising performance.
- Fault Tolerance: ArangoDB’s sharding architecture provides fault tolerance by ensuring that data is replicated across multiple servers. This means that if one server fails, the system can continue to function without data loss.
Sharding in ArangoDB is based on shard keys. To determine which shard a document should be stored in, ArangoDB uses a hash-based sharding mechanism. When a document is inserted into a collection, ArangoDB hashes the value of the document’s shard key. The shard key is a user-defined field that specifies how the data should be partitioned. By default this is the _key attribute of a document.
A graph in ArangoDB consists of a Graph Definition. In short, the Graph Definition itself gets a descriptive name and a list of participating collections. Collections, which are the containers for the documents themselves, are either of the type called Document Collection or Edge Collection. The data nodes are being stored in the Document Collection, the relations connecting those nodes are called edges and are being stored in Edge Collections.
An edge is a relation that connects two nodes with each other. In ArangoDB an edge is always persisted with a fixed direction, but can be accessed in any direction within graph queries. To get all details and option parameters about Graphs, Vertices, Edges and Graph Definitions in ArangoDB itself, please read our full documentation. In case you want to learn more about graphs in general, please refer to this blog article.
Example Graph Concept
To help you understand the major differences between single server instances and clustered environments, we’ll introduce a simple example graph concept we’ll make use of throughout this blog post.
Our example graph consists only of a limited number of nodes to keep it simple. Those nodes are connected via edges. On a single server instance all of that data would be stored on a single machine.
Challenges with Distributed Graphs
On a single server instance, all data we have is stored locally. For any graph search, whether it is a Traversal (DFS, BFS or Weighted) or a Shortest Path (Weighted, K-Paths or All-Paths) calculation, we achieve the best performance as we do not need to perform any additional network requests or manage collections being split into shards onto different database servers. This statement is true until the machine’s capacity is reached.
Therefore we need to organize our graph in such a way that we can scale out to multiple machines. At the same time, we must organize our data in a smart way so that we can run as many computations on the database servers as possible (e.g. to allow parallel computations). This optimization will then lead to a reduced amount of network communication, as network requests are usually one of the main reasons to make the performance worse.
The better the graph data itself is organized, the better the query performance will be. Better data organization means, we try to achieve the best data locality we can get as this will drastically reduce the total amount of network requests we need to do, while at the same time we can fan out to multiple machines and continue our graph algorithm there in parallel.
In the upcoming sections we will cover the different types of graph concepts in ArangoDB with a special look at their distribution in a clustered environment and their sharding characteristics.
Sharded Graphs Concepts
ArangoDB offers multiple graph types to create and handle graphs in a clustered environment with different sharding strategies. All types are based on the same manner using Graph Definitions.
The Community Edition of ArangoDB includes:
- General Graph
The Enterprise Edition of ArangoDB adds two major graph types on top:
(ArangoDB offers even more specialized graph types we’ll dig into in an upcoming blog article.)
All of them support various configuration parameters during creation, e.g.:
The amount of Shards used per Collection, which will then automatically be applied to any Document or Edge Collection which is part of the Graph Definition. It describes the number of logical partitions (Shards) that a collection will be divided into.
The Replication Factor, which specifies the number of replicas that should be created for each shard in a database cluster.
The Write Concern, which specifies the minimum number of database servers or replicas that must acknowledge a write operation before the operation is considered successful.
The General Graph is the basic graph type in ArangoDB that allows you to create and manage graphs. It is suitable for small-scale graph use cases and does not require any specific configuration or setup. In a General Graph, the data will be randomly distributed across all configured machines and each machine will take an equal portion of data. It is very easy to realize as no knowledge about the data is required. This graph type will always work, but there will be a few disadvantages with this approach. As we distribute randomly, Neighbors will land on different machines. Therefore, edges will also with high probability end up on other servers as their connected nodes. The worst case which can occur is that in a relation (Bob) ⇒ (Alice), Bob will land on database server A, the relation itself will land on database server B and Alice will land on database server C, which then leads to a lot of required network requests. This is then finally reflected in a lack of query execution performance.
The EnterpriseGraph is a more advanced type of graph in ArangoDB that is designed to support large-scale graph use cases in enterprise environments. EnterpriseGraphs will allow you to create graphs at scale with automated sharding key selection.
While the data itself is also “randomly sharded” (like in General Graphs), this specific graph type ensures that all edges adjacent to a vertex are co-located on the same server. This approach provides significant advantages as it minimizes the impact of having suboptimal sharding keys defined when creating the graph. It will give a vast performance benefit for all graphs sharded in an ArangoDB Cluster, reducing network hops substantially as more graph calculations can be done on the database nodes itself.
The only consequence here is, that you cannot define custom _key values on edges, as ArangoDB calculates its value based on the shard key – and this specific calculation is being done by the EnterpriseGraph graph type automatically. EnterpriseGraphs do store some additional meta information which leads to a slightly increased amount of disk usage compared to GeneralGraphs.
While EnterpriseGraphs are already improving the data distribution across multiple database servers, SmartGraphs are optimizing data distribution even further. Graphs in general know nothing of themselves. But, your application knows a lot about the graph. In many datasets there are highly interconnected communities, but few connections between these communities. For instance, a set covering your customers, regions or any other logic you apply to organize your graph at the application layer can in turn be used in sharding the graph through the cluster.
Think about a social network. In a social network, which connects people through relations, it is more likely that a person has more regional friends, relatives or followers. From this knowledge, we can choose the ideal property to define the graphs sharding. This property is called the smartGraphAttribute and needs to be defined for a SmartGraph in ArangoDB. This results in highly-connected communities landing on the same database server, which will improve query execution times compared to EnterpriseGraphs even more. A simple performance comparison between a General Graph and a SmartGraph example can be found here.
The storage size of SmartGraphs is comparable to the storage size of EnterpriseGraphs. At most it will be equal to the size of an EnterpriseGraph. At best it is close to the resource consumption of a General Graph.
All the graph types ArangoDB offers have several advantages and disadvantages. All of them share the ability to scale your graph to multiple machines in a clustered environment. You can always start exploring using the General Graph.
If the performance does not fit your needs – try to use EnterpriseGraph, as you don’t need to manually take care of the graph data distribution. You should directly see better query execution times.
In case your requirements are still not met yet, you should start to organize your graph dataset in a smarter way. While you should always think about an ideal way of modeling your graph, the next graph type enforces you to define a property. Think about a good way to structure your graph into multiple assets with the primary goal to use SmartGraphs – which then will drastically improve the performance as then we can organize the data throughout all database servers in the best way possible.
Therefore, think carefully which graph type fits best to the schema of your graph and its dataset. Ideally, you choose the best fitting graph type for your use-case in advance, as converting to another graph type might cause extra effort (find details about graph migration here).
Want to even learn more about advanced graph sharding strategies? Stay connected. A more advanced blog post will follow soon about OneShard Graphs, Disjoint SmartGraphs, Hybrid SmartGraphs and Hybrid EnterpriseGraphs and SatelliteGraphs, which allow you to fine-tune graph sharding to the next level.
Call to Action
Thank you for reading!
Feel free to connect and ask us any questions @ https://www.arangodb.com/community/.
Also don’t forget to check our free-trial on our cloud offering, ArangoGraph Insights Platform: https://www.arangodb.com/download/#try-cloud
Get the latest tutorials, blog posts and news: