Benchmark Results – ArangoDB vs. Neo4j : ArangoDB up to 8x faster than Neo4j
Introduction
This document presents the benchmark results comparing the ArangoDB’s Graph Analytics Engine (GAE) against Neo4j. The GAE is just one component of ArangoDB’s Data Science Suite.
This reproducible benchmark aims to provide a neutral and thorough comparison between the two databases, ensuring a fair and unbiased assessment.
We use the wiki-Talk dataset, a widely used, real-world graph dataset derived from the edit and discussion history of Wikipedia.
The wiki-Talk dataset encapsulates communication patterns between Wikipedia users, specifically interactions on user talk pages. This dataset is used frequently in benchmarking graph databases and graph analytics systems because of its unique characteristics. The key characteristics of wiki-Talk that make it a highly reliable benchmarking dataset are: Directed Graph, Nodes and Edges, Scale, Temporal Dimension, Sparsity, etc.
The results demonstrate the efficiency and scalability of each database, and offer a representative benchmark model for organizations evaluating graph databases for their needs.
Benchmark Highlights
The benchmark results reveal several notable insights, particularly highlighting ArangoDB's superior performance in graph analytics tasks compared to Neo4j. Most strikingly:
- ArangoDB consistently outperformed Neo4j across various graph computation algorithms, with performance improvements that range from 1.3 times to over 8 times faster.
- This substantial speed advantage is also evident in graph loading times, where ArangoDB demonstrated an impressive 100% advantage in graph loading efficiency vs Neo4j, for the wiki-Talk dataset.
ArangoDB's optimized data storage and retrieval, combined with its advanced query execution and effective use of clustered deployments, also contributed significantly to its superior performance in these scenarios.
These findings underscore:
- ArangoDB's capability to handle much larger-scale and far faster real-time graph analytics applications.
- ArangoDB as a much more compelling choice for industries and organizations that require rapid data processing and analysis, such as real-time recommendation systems, social network analysis, fraud detection, and cyber security.
Benchmark Overview
Datasets (wiki-Talk)
We utilized the wiki-Talk dataset, a well-regarded dataset for evaluating graph database performance. The chosen graphs and their details are as follows:
Graphs Used | Vertices | Edges |
---|---|---|
wiki-Talk | 2,394,385 | 5,021,410 |
Hardware
All tests were conducted on the same machine with the following specifications:
OS Ubuntu 23.10 (64-bit)
Memory 192 GB (4800 MHz)
CPU Ryzen 9 7950X3D (16 Cores, 32 Threads)
Database Configuration
***Neo4j***
Version 5.19.0 (Community Edition)
Deployment On-Premise, Single Process
***ArangoDB***
Version 3.12.0-NIGHTLY.20240305 (Community Edition)
Deployment On-Premise, Single Process
Graph Analytics Engine (GAE)
Version Latest
Deployment On-Premise, Single Process (RUST-based, no multithreading)
Benchmark Configuration
Two workflows were used to measure performance:
Workflow A:
- Create the in-memory representation
- Execute each algorithm once
- Measure the whole process
Workflow B
- Create the in-memory representation
- Measure graph creation time
- Execute each algorithm individually
- Measure computation time
Algorithms Tested
- Pagerank
- Weakly Connected Components (WCC)
- Strongly Connected Components (SCC)
- Label Propagation
Used Technologies
- JavaScript Framework: Vitest with tinybench
- Communication
- Neo4j: Official Neo4j JS driver ("neo4j-driver": "^5.18.0")
- GAE: Plain HTTPs requests using Axios ("axios": "^1.6.8")
Benchmark Results
Graph Loading (wiki-Talk)
Task | GAE (sec) | Neo4j (sec) | Times Faster |
---|---|---|---|
Load graph wiki-Talk | 9.9 | 18 | 1.8 x |
Load Graph wiki-Talk with Attributes | 10.7 | 19.2 | 1.8 x |

Graph Computation (wiki-Talk)
Task | GAE (sec) | Neo4j (sec) | Times Faster |
---|---|---|---|
Compute PageRank | 3.8 | 10.6 | 2.8 x |
Compute WCC | 2.3 | 4.5 | 1.7 x |
Compute SCC | 3.2 | 6.7 | 2.1 x |
Compute Label Propagation | 1.5 | 13 | 8.5 x |
Explanation of Elements
Graph Algorithms
- Pagerank, An algorithm that is used to rank nodes in a graph based on their connections, also commonly used in search engines.
- Weakly Connected Components (WCC), which identifies subsets of a graph where any two vertices are connected by paths, ignoring the direction of edges.
- Strongly Connected Components (SCC), Identifying subsets of a graph where every vertex is reachable from every other vertex within the same subset.
- Label Propagation, a semi-supervised learning algorithm for community detection in graphs, where nodes propagate their labels to their neighbors iteratively.
Reasons for ArangoDB’s Superior Performance
Several factors contribute to ArangoDB's superior performance:
The performance of ArangoDB on the Wiki-Talk dataset is attributed to specific architectural optimizations rather than on raw computational benchmarks. In this scenario, ArangoDB serves as a data storage system, while the computation is handled by the Graph Analytics Engine (GAE). The benchmark focuses on two key stages:
- Loading the data into the GAE
- Computation of algorithms within the GAE
Graph Loading Times
ArangoDB Side
ArangoDB’s graph loading times are optimized due to two primary factors:
- Parallel Data ExtractionArangoDB’s support for parallel data loading from both single and distributed systems is a big reason for data loading performance advantages. This capability lets teams scale to multiple machines, where increased parallelism gets you faster data transfer. By enabling efficient horizontal scaling, the system achieves significant performance improvements compared to approaches that are limited to sequential or that don’t leverage parallel extractions.
- Projections for Targeted Data TransferProjections allow ArangoDB to transmit only the data attributes required for analysis. So, if only edge IDs and a single attribute are needed, the system only extracts and transfers these fields, avoiding the overhead of transmitting entire documents. This reduces both the data volume and network latency during graph loading operations.
Graph Analytics Engine (GAE) Side
The GAE is built using RUST, and it processes the transferred data with high efficiency:
- Efficient Data Representation
The GAE stores graph data within highly optimized in-memory structures, reducing memory usage while at the same time maintaining extremely fast access speeds. Graphs are immediately ready for computation without unnecessary delays.
Advantages in the Workflow
These features deliver several tangible benefits, as shown during the benchmark:
- Fast and Parallel Data Extraction - Parallelism improves speed and scalability.
- Optimized Data Transfer with Projections - Only the required data is transmitted, minimizing overhead.
- Compact and Efficient In-Memory Representation in GA - High-performance graph computation with minimal memory footprint.
Clarifying the Benchmark Scope
It is important to note that the benchmark does not evaluate data insertion times into ArangoDB or computational tasks performed by ArangoDB itself. Instead, it assesses the efficiency of:
- Loading graph data from ArangoDB into the GAE.
- The GAE's ability to compute graph algorithms.
By highlighting these stages, the benchmark shows the advantages of ArangoDB’s design in supporting large-scale graph workflows through fast data loading and efficient interaction with the GAE.
Reproducibility of the Benchmark
This benchmark is 100% reproducible, ensuring consistent and verifiable results. These results reflect ArangoDB’s implementation per the precise specifications and configurations mentioned above. We welcome organizations to replicate the benchmark to ensure consistent results. To do this, follow these steps:
- First, set up the hardware environment with an Ubuntu 23.10 operating system, 192 GB of memory, and a Ryzen 9 7950X3D CPU.
- Install and configure the latest versions of Neo4j and ArangoDB using the provided Docker configurations. Use single-threaded (non-clustered) configurations for both.
- Next, utilize the wiki-Talk dataset for testing. Execute the specified graph algorithms (PageRank, WCC, SCC, Label Propagation) using the detailed workflows (A and B) outlined in the benchmark configuration above.
- Measure the in-memory graph creation and computation times, and compare the results for both databases. This method ensures that the benchmark can be reliably reproduced in different environments.
PLEASE NOTE: This benchmark requires the installation of the ArangoDB Graph Analytics Engine (GAE). As this code is not open source, please reach out to Corey Sommers at corey.sommers@arangodb.com to receive access to the GAE for the purposes of reproducing this benchmark in your environment (to ensure objectivity of results).
Conclusion
The benchmark results clearly demonstrate ArangoDB's far superior performance over Neo4j in the categories of graph computation and loading tasks. ArangoDB's significant speed advantages - particularly its ability to execute complex algorithms and load large datasets much faster - highlight its optimized architecture and efficient data handling.
These findings make ArangoDB a compelling choice for applications requiring high-performance graph analytics and real-time data processing.
Vector Search in ArangoDB: Practical Insights and Hands-On Examples
Estimated reading time: 5 minutes
Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach. The Vector Search capability is currently in Developer Preview and will be in production release in Q1, 2025.
This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.
(more…)Some Perspectives on HybridRAG in an ArangoDB World
Estimated reading time: 7 minutes
Introduction
Graph databases continue to gain momentum, thanks to their knack for handling intricate relationships and context. Developers and tech leaders are seeing the potential of pairing them with the creative strength of large language models (LLMs). This combination is opening the door to more precise, context-aware answers to natural language prompts. That’s where RAG comes in—it pulls in useful information, whether from raw text (VectorRAG) or a structured knowledge graph (GraphRAG), and feeds it into the LLM. The result? Smarter, more relevant responses that are grounded in actual data.
(more…)ArangoDB 3.12 – Performance for all Your Data Models
Estimated reading time: 6 minutes
We are proud to announce the GA release of ArangoDB 3.12!
Congrats to the team and community for the latest ArangoDB release 3.12! ArangoDB 3.12 is focused on greatly improving performance and observability both for the core database and our search offering. In this blog post, we will go through some of the most important changes to ArangoDB and give you an idea of how this can be utilized in your products.
(more…)The world is a graph: How Fix reimagines cloud security using a graph in ArangoDB
'Guest Blog'
Estimated reading time: 5 minutes
In 2015, John Lambers, a Corporate Vice President and Security Fellow at Microsoft wrote “Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.ˮ
The original problem in cloud security is visibility into my assets. If security engineers donʼt know what cloud services are running, they canʼt protect an environment. Unfortunately, first generation cloud security products were built with a list mindset, i.e. “rows and columnsˮ. They generate a list of assets and their configurations – but show no context of the relationships between connected cloud services, such as as a connection that would allow lateral movement between two disparate cloud assets.
Cloud security as a graph
A graph database like ArangoDB provides a powerful way to represent and analyze complex relationships in cloud security.
A graph is the easiest way to understand how one entity in my cloud interacts with another. By representing cloud assets as nodes in a graph and the relationships between them as vertices, I can now gain a better understanding of the nested connections in my cloud infrastructure.
By thinking about cloud resources in terms of ancestors and descendants, a cloud security engineer can solve problems in a way a table canʼt. The graph is an easier way to visualize the relationships between users and any of my cloud resources such as compute instances, functions, storage buckets and databases.
- Ancestors: The graph helps me understand the root of a security issue. What is the highest ancestor where an issue was introduced? Because I need to go all the way up and fix the problem at its origin.
- Descendants: The other way around is understanding descendants and blast radius. If I have an Internet-exposed compute instance, where an attacker is maybe able to get credentials off that instance, how many hops can that attacker go in? How much of my infrastructure is exposed due to this initial compromise?
In a cloud-native world, these graph traversal capabilities are fundamental for cloud security. Going forward, any operating model for cloud security should be built on a graph. With Fix, weʼre building such a modern cloud security tool, and weʼre building it with ArangoDB.
But first, a list!
Now that we covered the benefits of using a graph for cloud security, letʼs start with a list. Yes, a list – because sometimes, viewing my cloud assets in a graph might not be the most intuitive or useful thing.
For example, I may just want a list of my compute instance inventory across my AWS accounts. As a cloud security engineer, I want a baseline inventory of resources. I don’t really need a picture for that, I just want the list. And maybe I want to download it in a spreadsheet so I can slice and dice it, with metadata for each particular instance like create date, number of vCPUs and memory. A list is the best way to represent that information.
But if a list is enough, why collect data in a graph in the first place?
Because transformation from a graph to a table is trivial. The other way around, not so much. The graph lets you express things in a way that if you had the same data in a flat table, it would become intractable, with many different tables, foreign key relationships, and creating all kinds of joints all over the place. It just becomes too difficult to reason about.
The hard part is collecting data from cloud APIs and putting it into a graph form. Thatʼs much harder, takes time and is easy to get wrong. There are enough opportunities to make mistakes along the way, and create a representation thatʼs not correct or has bugs. Thatʼs why we believe transparency in how a cloud security product collects data matters. Both ArangoDB and Fix are open source. Our code shows how we collect and store data from cloud APIs in ArangoDB.
Graph-based analysis of cloud resources
The analysis layer of a graph is powerful because it can provide insights that tables cannot. One recent trend in security is that software engineers also take on security engineering tasks. They look after the security of their infrastructure, beyond infrastructure-as-code templates.
While Fix offers out-of-the-box visualizations and pre-built checks of compliance rules, weʼve also built a search syntax on top of the ArangoDB Query Language (AQL). With ArangoDB and AQL, I can store and query rich nested JSON-like document together with their vertices. Itʼs also easier to add and query metadata to the vertices – such as configuration data for a cloud resource. By building our syntax on top of AQL, weʼve made Fix human-friendly. Developers can easily run ad-hoc checks of the security posture of their infrastructure.
For example, activating flow logs in your VPCs is considered a security best practice by AWS. The search below finds all AWS VPCs where flow flogs are deactivated.
is(aws_vpc) with(empty, --> is(aws_ec2_flow_log))
Breaking it down, the search:
- first, finds all resources of the kind “aws_vpcˮ, no matter in which account or region they may run.
- then, filters for the VPCs without a direct relationship (successor) to an “aws_ec2_flow_logˮ resource.
A simple one line statement.
The same query expressed in SQL would require joining different tables with nested select statements, multiple where-clauses and case statements. It would be dozens of lines long and require an engineer to have knowledge of the table architecture and column names.
The power of a graph is that it lets you explore many-to-many relationships in a very easy way, in a way that a traditional row-based database just canʼt. By making security data from cloud resources available in a graph, software engineers with security responsibilities can gain visibility into the environment and reduce risks.
A graph provides context, context is king
The partnership between Fix and the ArangoDB team has brought our customers new security insights only made possible by the multi-dimensional relations of cloud resources stored in a graph. With ArangoDB, using graphs is no longer a complex computer science and operational challenge. For Fix, ArangoDB provides a graph database as a building block that makes it easy to store and query the relationships in your data.
Fix uses ArangoDB to analyze billions of relationships – in every cloud. With ArangoDB, weʼve been able to build a system that can ingest data at scale. One of our retail users ingests data from tens of thousands of cloud accounts in minutes, and then runs any type of analytics in a fraction of a second. The context of the graph helps security engineers to precisely answer questions and identify, prioritize and remediate risks – the “trifectaˮ of cloud security.
The precision, speed, and explainability of finding risks to your business is simply not possible without using a graph. When defenders can think in graphs, attackers lose.
Reintroducing the ArangoDB-RDF Adapter
Introducing ArangoDB’s Data Loader : Revolutionizing Your Data Migration Experience
Estimated reading time: 7 minutes
At ArangoDB, our commitment to empowering companies, developers, and data enthusiasts with cutting edge tools and resources remains unwavering. Today, we’re thrilled to unveil our latest innovation, the Data Loader, a game-changing feature designed to simplify and streamline the migration of relational databases to ArangoGraph. Let’s dive into what makes Data Loader a must-have tool for your data migration needs.
(more…)Introducing the ArangoDB-PyG Adapter
Estimated reading time: 10 minutes
We are proud to announce the GA 1.0 release of the ArangoDB-PyG Adapter!
The ArangoDB-PyG Adapter exports Graphs from ArangoDB, the multi-model database for graph & beyond, into PyTorch Geometric (PyG), a PyTorch-based Graph Neural Network library, and vice-versa.
On July 29 2022, we introduced the first release of the PyTorch Geometric Adapter to the ArangoML community. We are proud to have PyG as the fourth member of our ArangoDB Adapter Family. You can expect the same developer-friendly adapter options and a helpful getting-started guide via Jupyter Notebook, and stay tuned for an upcoming Lunch & Learn session!
This blog post will serve as a walkthrough of the ArangoDB-PyG Adapter, via its official Jupyter Notebook.
(more…)Integrate ArangoDB with PyTorch Geometric to Build Recommendation Systems
Estimated reading time: 20 minutes
In this blog post, we will build a complete movie recommendation application using ArangoDB and PyTorch Geometric. We will tackle the challenge of building a movie recommendation application by transforming it into the task of link prediction. Our goal is to predict missing links between a user and the movies they have not watched yet.
(more…)ArangoSync: A Recipe for Reliability
Estimated reading time: 18 minutes
A detailed journey into deploying a DC2DC replicated environment
When we thought about all the things we wanted to share with our users there were obviously a lot of topics to choose from. Our Enterprise feature; ArangoSync was one of the topics that we have talked about frequently and we have also seen that our customers are keen to implement this in their environments. Mostly because of the secure requirements of having an ArangoDB cluster and all of its data located in multiple locations in case of a severe outage.
This blog post will help you set up and run an ArangoDB DC2DC environment and will guide you through all the necessary steps. By following the steps described you’ll be sure to end up with a production grade deployment of two ArangoDB clusters communicating with each other with datacenter to datacenter replication.
All of the best practices that we use during our day-to-day operations regarding encryption and secure authentication have been used while writing this blog post and every step in the setup will be explained in detail, so there will be no need to doubt, research or ponder about which options to use and implement in any situation; Your home lab, your production grade database environment and basically anywhere you want to run a deployment like this.
A note of importance however is that the ArangoSync feature including the used encryption at rest are Enterprise features that we don’t offer in our Community version of ArangoDB. If you don’t have an available Enterprise license for this project, you can download an evaluation version that has all functionality at: https://www.arangodb.com/download-arangodb-enterprise/
That's a lot of words as an introduction but what actually is ArangoSync?
ArangoSync is our Enterprise feature that enables you to seamlessly and asynchronously replicate the entire structure and content in an ArangoDB cluster in one location to a cluster in another location. Imagine different cloud provider regions or different office locations in your company.
To run successfully, ArangoSync needs two fully functioning clusters and will not be useful when you’re only running a single instance of ArangoDB. So please also consider this when you’re making any plans to change or implement your database architecture.
In the above explanation I mentioned that ArangoSync works asynchronously. What this basically means is that when a client processes and writes data into the source datacenter, it will consider the request to be complete and finished before the data has been replicated to the other datacenter. The time needed to completely replicate changes to the other datacenter is typically in the order of seconds, but this can vary significantly depending on load, network & computer capacity, so be mindful of what hardware you choose so it will have a positive and useful impact on your environment in terms of performance and suits your use case.
ArangoSync performs replication in a single direction only. That means that you can replicate data from cluster A to cluster B or from cluster B to cluster A, but never to and from both at the same time.
ArangoSync runs a completely autonomous distributed system of synchronisation workers. Once configured properly via this blog post or any related documentation, it is designed to run continuously without manual intervention from anyone.
This of course doesn’t mean that it doesn’t require any maintenance at all and as with any distributed system some attention is needed to monitor its operation and keep it secure (Think of certificate & password rotation just to name two examples).
Once configured, ArangoSync will replicate both the structure and data of an entire cluster. This means that there is no need to make additional configuration changes when adding/removing databases or collections. Any data or metadata in the cluster will be automatically replicated.
When to use it… and when not to use it
ArangoSync is a good solution in all cases where you want to replicate data from one cluster to another without the requirement that the data is available immediately in the other cluster.
If you’re still doubting whether ArangoSync is the option for you then review the following list of no’s and if they apply to you or your organization.
- You want to use bidirectional replication data from cluster A to cluster B and vice versa.
- You need synchronous replication between 2 clusters.
- There is no network connection between cluster A and B.
- You want complete control over which database, collection & documents are replicated and which ones will not be.
Okay I’m done reading the official part, now let's get started!
To start off with the first ArangoDB cluster, you will need at least 3 nodes. In this blog post, we’re of course using 6 nodes for both data centers, meaning 3 nodes per datacenter.
As an example we’re using the hypothetical locations dc1 and dc2 which can be located anywhere in the world but also in your test environment living in multiple VMs;
sync-dc1-node01
sync-dc1-node02
sync-dc1-node03
sync-dc2-node01
sync-dc2-node02
sync-dc2-node03
The three nodes with “dc1” are located in the first datacenter, the three nodes with “dc2” are located in the second datacenter. To test the location can of course be any local environment that supports running six nodes at once with sufficient resources.
In this blog post, we picked Ubuntu Linux as the OS but as we’re using the .tar.gz
distribution with static executables, this means that you can choose whatever Linux distribution your organization runs and that you’re comfortable with. To control the ArangoDB installation, we use systemd, so the distribution should support systemd or you have to change things for automatic restarts after a reboot.
Currently, the most recent release of ArangoDB is version 3.8.0, so all of our examples mentioning file names will be using the arangodb3e-linux-3.8.0.tar.gz
binary and the following ports need to be open/accessible on each node;
- 8528 : starter
- 8529 : coordinator
- 8530 : dbserver
- 8531 : agent
- 8532 : syncmaster
- 8533 : syncworker
Obviously, the process name next to the port name is for illustration purposes so you know what port belongs to which process.
We will roll out the clusters as the root user, but this is of course not necessary. In fact, our own packages create the arangodb user with the installation. It is considered good practice to keep file ownerships separate for services and should be done so in production environments. We could have used any normal user account, provided we have access to the nodes and their filesystem.
The only significant part where we need root access is to set up the systemd service. Another important prerequisite is that you have properly configured SSH access to all nodes.
Setting up ArangoDB clusters - A detailed overview
We will go through all of the detailed next steps. There are a bunch of them to follow so grab a cup of coffee or tea and sit back to work on this. As you might notice, the second half of the steps are repetitive as we’re setting up two similar clusters so we did not make a mistake to make you think you’ve misread. The settings for both clusters slightly differ from each other and therefore we need to separate the installation steps. All commands you need to follow are explained and written out in detail and can even be copied and pasted for your own future reference when you’d like to automate the installation steps in your own environment.
Extract the downloaded binary in its target location:
Assuming the archive file arangodb3e-linux-3.8.0.tar.gz
is present on your local machine, we deploy it to each node in the first cluster with the following commands:
scp arangodb3e-linux-3.8.0.tar.gz root@sync-dc1-node01:/tmp
scp arangodb3e-linux-3.8.0.tar.gz root@sync-dc1-node02:/tmp
scp arangodb3e-linux-3.8.0.tar.gz root@sync-dc1-node03:/tmp
To install ArangoDB, we run the following command on all cluster nodes:
mkdir -p /arangodb/data
cd /arangodb
tar xzvf /tmp/arangodb3e-linux-3.8.0.tar.gz
export PATH=/arangodb/arangodb3e-linux-3.8.0/bin:$PATH
A quick check to test the installation for functionality:
cd /arangodb
mkdir data
cd data
arangodb --starter.mode=single
This will launch a single server on each machine on port 8529, without any authentication, encryption, or anything. You can point your browser to the nodes on port 8529 to verify that the firewall settings are correct. If this does not work and you cannot reach the UI of the database, you should stop here and debug your firewall otherwise, you are bound to run into more difficult trouble later on, for example, because the processes in your cluster cannot reach each other over the network.
Afterward, simply press Control-C and run the following to clean up:
cd /arangodb/data
rm -rf *
Having tested basic functionality, let's get to the actual deployment of the cluster;
Create a shared secret for the first cluster
The different processes in the ArangoDB cluster must authenticate themselves against each other. To this end, we require a shared secret, which is deployed to each cluster machine. Here is a simple way to create such a secret on your laptop and to deploy it to each of the cluster nodes:
arangodb create jwt-secret
scp secret.jwt root@sync-dc1-node01:/arangodb/data
scp secret.jwt root@sync-dc1-node02:/arangodb/data
scp secret.jwt root@sync-dc1-node03:/arangodb/data
Note that we are using the arangodb executable from the distribution to create a secret file secret.jwt. For this to work, you have to install ArangoDB on your laptop, too. If you want to avoid this, you can simply create all the secrets and keys on one of your cluster nodes and use arangodb there.
Please keep the file secret.jwt in a safe place, possession of the file grants unrestricted superuser access to the database.
Create a CA and server keys, Then deploy them:
All communications to the database as well as all communications within an ArangoDB cluster need to be encrypted via TLS. To this end, every process needs to have a pair of a private key and a corresponding public key (aka server certificate). During the steps of this blog post, we will create a self-signed CA key pair (the public CA key is signed by its own private key) and use that as the root certificate.
We use the following commands to create the CA keys tls-ca.key
(private) and tls-ca.crt
(public) as well as the server key and certificate in the keyfile files. A keyfile contains the private server key as well as the full chain of public certificates. Note how we add the server names into the server certificates:
arangodb create tls ca
arangodb create tls keyfile --host localhost --host 127.0.0.1 --host sync-dc1-node01 --host sync-dc1-node02 --host sync-dc1-node03 --keyfile sync-dc1-nodes.keyfile
scp sync-dc1-nodes.keyfile root@sync-dc1-node01:/arangodb/data
scp sync-dc1-nodes.keyfile root@sync-dc1-node02:/arangodb/data
scp sync-dc1-nodes.keyfile root@sync-dc1-node03:/arangodb/data
These commands are all executed on your local machine and deploy the server key to the cluster nodes. Keep the tls-ca.key
file secure, it can be used to sign certificates, in particular, do not deploy it to your cluster! Furthermore, keep the sync-dc1-node*.keyfile
files secure, since possession of them allows you to listen in to the communication with your servers.
Create an encryption key for encryption at rest:
ArangoDB can keep all the data on disk encrypted using the AES-256 encryption standard. This is a requirement for most secure database installations. To this end, we need a 32 byte key for the encryption. It can basically consist of random bytes. Use these commands to set up an encryption key:
dd if=/dev/random of=sync-dc1-nodes.encryption bs=1 count=32
chmod 600 sync-dc1-nodes.encryption
scp sync-dc1-nodes.encryption root@sync-dc1-node01:/arangodb/data
scp sync-dc1-nodes.encryption root@sync-dc1-node02:/arangodb/data
scp sync-dc1-nodes.encryption root@sync-dc1-node03:/arangodb/data
Keep the encryption key secret, because possession allows one to open a database at rest, if one can get hold of the database files in the filesystem.
Create a shared secret to use with ArangoSync:
The data center to data center replication in ArangoDB is implemented as a set of external processes. This allows for scalability and minimal impact on the actual database operations. The executable for the ArangoSync system is called arangosync and is packaged with our Enterprise Edition. Similar to the above steps, we need to create a shared secret for the different ArangoSync processes, such that they can authenticate themselves against each other.
We produce the shared secret in a way that is very similar to the one for the actual ArangoDB cluster:
arangodb create jwt-secret --secret syncsecret.jwt
scp syncsecret.jwt root@sync-dc1-node01:/arangodb/data
scp syncsecret.jwt root@sync-dc1-node02:/arangodb/data
scp syncsecret.jwt root@sync-dc1-node03:/arangodb/data
Keep the file syncsecret.jwt a secret, since its possession allows you to interfere with the ArangoSync system.
Create a TLS encryption setup for ArangoSync:
The same arguments about encrypted traffic and man-in-the-middle attacks apply to the ArangoSync system as explained above for the actual ArangoDB cluster. We choose to reuse the same CA key pair as above for the TLS certificate and key setup. For a change, we work with the same server keyfile for all three nodes.
Here we create the server keyfile, signed by the same CA key pair in tls-ca.key
and tls-ca.crt
:
arangodb create tls keyfile --host localhost --host 127.0.0.1 --host sync-dc1-node01 --host sync-dc1-node02 --host sync-dc1-node03 --keyfile synctls.keyfile
scp synctls.keyfile root@sync-dc1-node01:/arangodb/data
scp synctls.keyfile root@sync-dc1-node02:/arangodb/data
scp synctls.keyfile root@sync-dc1-node03:/arangodb/data
As usual, keep the file synctls.keyfile
secure, since its possession allows it to listen to the encrypted traffic with the ArangoSync system.
Set up client authentication setup for ArangoSync:
There is one more secretive thing to set up before we can hit the launch button. The two ArangoSync systems in the two data centers need to authenticate each other. Actually, the second data center (“DC B”, the replica), needs to authenticate itself with the first data center (“DC A”, the original). Since this is cross data center traffic, the authentication is done via TLS client certificates.
We create and deploy the necessary files with the following commands on your local machine:
arangodb create client-auth ca
arangodb create client-auth keyfile
scp client-auth-ca.crt root@sync-dc1-node01:/arangodb/data
scp client-auth-ca.crt root@sync-dc1-node02:/arangodb/data
scp client-auth-ca.crt root@sync-dc1-node03:/arangodb/data
Keep the file client-auth-ca.key
secret, since it allows signing additional client authentication certificates. Do not store this on any of the cluster nodes.
Also, keep the file client-auth.keyfile
secret, since it allows authentication with a syncmaster in either data center.
The first cluster can now be launched:
We launch the whole system by means of the ArangoDB starter, which is included in the ArangoDB distribution. We launch the starter via a systemd service file, which looks basically like the following snippet but feel free to adapt it to your needs:
[Unit]
Description=Run the ArangoDB Starter
After=network.target
[Service]
# system limits
LimitNOFILE=131072
LimitNPROC=131072
TasksMax=131072
Restart=on-failure
KillMode=process
Environment=SERVER=sync-dc1-node01
ExecStart=/arangodb/arangodb3e-linux-3.8.0/bin/arangodb \
--starter.data-dir=/arangodb/data \
--starter.address=${SERVER} \
--starter.join=sync-dc1-node01,sync-dc1-node02,sync-dc1-node03 \
--auth.jwt-secret=/arangodb/data/secret.jwt \
--ssl.keyfile=/arangodb/data/sync-dc1-nodes.keyfile \
--rocksdb.encryption-keyfile=/arangodb/data/sync-dc1-nodes.encryption \
--starter.sync=true \
--sync.start-master=true \
--sync.start-worker=true \
--sync.master.jwt-secret=/arangodb/data/syncsecret.jwt \
--sync.server.keyfile=/arangodb/data/synctls.keyfile \
--sync.server.client-cafile=/arangodb/data/client-auth-ca.crt \
TimeoutStopSec=60
[Install]
WantedBy=multi-user.target
Apart from some infrastructural settings like the number of file descriptors and restart policies, the service file basically runs a single command. It refers to the starter program arangodb
, which needs a few options to find all the secret files we have set up, these should be self-explanatory from what we have written above.
The network fabric of the cluster basically comes together since every instance of the starter gets told its own address (with the --starter.address
option), as well as a list of all the participating starters (with the --starter.join
option). We are achieving this by setting the actual server hostname in the line with Environment=SERVER=...
. Then we can refer to this environment variable with the syntax ${SERVER}
further down in the service file. This means that the above file has to be edited in just a single place for each individual machine, namely, you have to set the SERVER name.
Provided the above file has been given the name arango.service
on your local machine, then you can deploy the service with the following commands on local machine:
scp arango.service root@sync-dc1-node01:/etc/systemd/system/arango.service
scp arango.service root@sync-dc1-node02:/etc/systemd/system/arango.service
scp arango.service root@sync-dc1-node03:/etc/systemd/system/arango.service
You then have to edit this file and adjust the server name, as described above. Then you launch the service with the following commands on each node in the cluster:
systemctl daemon-reload
systemctl start arango
You can check the status of the service with:
systemctl status arango
Or investigate the live log file by running:
journalctl -flu arango
Please note that all the data for the cluster resides in subdirectories of /arangodb/data. Every instance on each machine has a subdirectory there that contains its port in the directory name. You can find further log files in these subdirectories.
You should now be able to point your browser to port 8529 on any of the nodes. Before that, we recommend that you tell your browser to trust the tls-ca.crt certificate for server authentication. Since the public server keys are signed by the private CA key, your browser can then successfully prevent any man-in-the-middle attack.
An important step is now to change the root password, which will be empty in the beginning. You can use the UI for this.
We now set up the second cluster in a completely similar way. We simply show the commands used for that as they differ in some detail related to node names and such:
Extract the downloaded binary in its target location:
On your local machine run:
scp arangodb3e-linux-3.8.0.tar.gz root@sync-dc2-node01:/tmp
scp arangodb3e-linux-3.8.0.tar.gz root@sync-dc2-node02:/tmp
scp arangodb3e-linux-3.8.0.tar.gz root@sync-dc2-node03:/tmp
Then on each machine of the second cluster:
mkdir -p /arangodb/data
cd /arangodb
tar xzvf /tmp/arangodb3e-linux-3.8.0.tar.gz
export PATH=/arangodb/arangodb3e-linux-3.8.0/bin:$PATH
Create a shared secret for the second cluster
Perform these commands on your local machine:
arangodb create jwt-secret --secret secretdc2.jwt
scp secretdc2.jwt root@sync-dc2-node01:/arangodb/data
scp secretdc2.jwt root@sync-dc2-node02:/arangodb/data
scp secretdc2.jwt root@sync-dc2-node03:/arangodb/data
Warning: Keep the file secretdc2.jwt
in a safe place, possession of the file grants unrestricted superuser access to the database.
Create a CA and server keys, Then deploy them:
Note that we are using the same pair of CA keys for the TLS setup here as before during the preparation of the first cluster, so we rely on the files tls-ca.key and tls-ca.crt on your local machine. Perform these commands:
arangodb create tls keyfile --host localhost --host 127.0.0.1 --host sync-dc2-node01 --host sync-dc2-node02 --host sync-dc2-node03 --keyfile sync-dc2-nodes.keyfile
scp sync-dc2-nodes.keyfile root@sync-dc2-node01:/arangodb/data
scp sync-dc2-nodes.keyfile root@sync-dc2-node02:/arangodb/data
scp sync-dc2-nodes.keyfile root@sync-dc2-node03:/arangodb/data
Keep the file sync-dc2-nodes.keyfile
secure, since possession of it allows one to listen in to the communication with your servers.
Create an encryption key for encryption at rest
This is totally parallel to what we did for the first cluster. On your local machine run:
dd if=/dev/random of=sync-dc2-nodes.encryption bs=1 count=32
chmod 600 sync-dc2-nodes.encryption
scp sync-dc2-nodes.encryption root@sync-dc2-node01:/arangodb/data
scp sync-dc2-nodes.encryption root@sync-dc2-node02:/arangodb/data
scp sync-dc2-nodes.encryption root@sync-dc2-node03:/arangodb/data
Keep the encryption key secret, because possession allows one to open a database at rest, if one can get hold of the database files in the filesystem.
Create a shared secret to use with ArangoSync:
Run the following on your local machine:
arangodb create jwt-secret --secret syncsecretdc2.jwt
scp syncsecretdc2.jwt root@sync-dc2-node01:/arangodb/data
scp syncsecretdc2.jwt root@sync-dc2-node02:/arangodb/data
scp syncsecretdc2.jwt root@sync-dc2-node03:/arangodb/data
Keep the file syncsecretdc2.jwt
a secret, since its possession allows it to interfere with the ArangoSync system.
Create a TLS encryption setup for ArangoSync:
Again, we proceed exactly as for the first cluster. Do this on your local machine:
arangodb create tls keyfile --host localhost --host 127.0.0.1 --host sync-dc2-node01 --host sync-dc2-node02 --host sync-dc2-node03 --keyfile synctlsdc2.keyfile
scp synctlsdc2.keyfile root@sync-dc2-node01:/arangodb/data
scp synctlsdc2.keyfile root@sync-dc2-node02:/arangodb/data
scp synctlsdc2.keyfile root@sync-dc2-node03:/arangodb/data
As usual, keep the file synctlsdc2.keyfile secure, since its possession allows you to listen to the encrypted traffic with the ArangoSync system.
Set up client authentication setup for ArangoSync:
Note that for simplicity, we use the same client authority CA for DC B as we did for DC A. This is not necessary but avoids a bit of confusion. On your local machine run:
arangodb create client-auth keyfile --host localhost --host 127.0.0.1 --host sync-dc2-node01 --host sync-dc2-node02 --host sync-dc2-node03 --keyfile client-auth-dc2.keyfile
scp client-auth-ca.crt root@sync-dc2-node01:/arangodb/data
scp client-auth-ca.crt root@sync-dc2-node02:/arangodb/data
scp client-auth-ca.crt root@sync-dc2-node03:/arangodb/data
Make sure to keep the file client-auth-ca.key and client--auth-par.keyfile secretly stored outside of the cluster.
Now, we’re ready to launch the second cluster:
We use a very similar service file like we did with the first cluster:
[Unit]
Description=Run the ArangoDB Starter
After=network.target
[Service]
# system limits
LimitNOFILE=131072
LimitNPROC=131072
TasksMax=131072
Restart=on-failure
KillMode=process
Environment=SERVER=sync-dc2-node01
ExecStart=/arangodb/arangodb3e-linux-3.8.0/bin/arangodb \
--starter.data-dir=/arangodb/data \
--starter.address=${SERVER} \
--starter.join=sync-dc2-node01,sync-dc2-node02,sync-dc2-node03 \
--auth.jwt-secret=/arangodb/data/secretdc2.jwt \
--ssl.keyfile=/arangodb/data/sync-dc2-nodes.keyfile \
--rocksdb.encryption-keyfile=/arangodb/data/sync-dc2-nodes.encryption \
--starter.sync=true \
--sync.start-master=true \
--sync.start-worker=true \
--sync.master.jwt-secret=/arangodb/data/syncsecretdc2.jwt \
--sync.server.keyfile=/arangodb/data/synctlsdc2.keyfile \
--sync.server.client-cafile=/arangodb/data/client-auth-ca.crt \
TimeoutStopSec=60
[Install]
WantedBy=multi-user.target
Provided the above file is named arango-dc2.service on your local machine, then you can deploy the service with the following commands on your local machine:
scp arango-dc2.service
root@sync-dc2-node01:/etc/systemd/system/arango.service
scp arango-dc2.service root@sync-dc2-node02:/etc/systemd/system/arango.service
scp arango-dc2.service root@sync-dc2-node03:/etc/systemd/system/arango.service
Then, edit this file and adjust the server name, as described above. Then you need to launch the service with these commands on each machine in the cluster:
systemctl daemon-reload
systemctl start arango
You can query the status of the service with:
systemctl status arango
Or investigate the live log file by running:
journalctl -flu arango
Note that all the data for the cluster resides in subdirectories of /arangodb/data. Every instance on each machine has a subdirectory there which contains its port in the directory name. You can find further log files in these subdirectories.
You should now be able to point your browser to port 8529 on any of the nodes and connect to the ArangoDB UI.
Don’t forget to change the root password for your installation. This can be easily done via the ArangoDB UI.
Enable ArangoSync synchronization and start it by using the CLI:
ArangoSync is controlled via its CLI, this CLI will be installed together with ArangoDB so there's no need to separately search for it in order to download and install it.
To configure DC to DC synchronisation from DC A to DC B, you now have to run this command on your local machine:
arangosync configure sync --master.endpoint=https://sync-dc2-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password> --source.endpoint=https://sync-dc1-node01:8532 --source.cacert=tls-ca.crt --master.keyfile=client-auth.keyfile
If you want (or need) to check if replication is running, you can run the following two commands:
arangosync get status -v --master.endpoint=https://sync-dc1-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password>
arangosync get status -v --master.endpoint=https://sync-dc2-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password>
Get detailed information on running synchronization tasks:
arangosync get tasks -v --master.endpoint=https://sync-dc1-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password>
arangosync get tasks -v --master.endpoint=https://sync-dc2-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password>
Momentarily stop the synchronization process:
arangosync stop sync --master.endpoint=https://sync-dc2-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password> --ensure-in-sync=true
This will briefly stop writes to DC A until both clusters are in perfect sync. It will then stop synchronization and switch DC B back to read/write mode. You can use the switch --ensure-in-sync=false
if you do not want to wait for synchronization to be ensured.
Abort synchronization:
Losing connectivity between the two separate locations because of a network or other outage will mean you need to stop synchronization entirely with the following command:
arangosync abort sync --master.endpoint=https://sync-dc2-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password>
This is the command to abort the synchronization on the target side (DC B). If there is no connectivity between the clusters, then this will naturally not abort the outgoing synchronization in DC A. Therefore, it is possible that you have to additionally send a corresponding command to the syncmaster in DC A, too.
It has a slightly different syntax:
arangosync abort outgoing sync --master.endpoint=https://sync-dc1-node01:8532 --master.cacert=tls-ca.crt --auth.user=root --auth.password=<password> --target <id>
The <id>
will need to be replaced by the ID of the cluster in DC B. You can retrieve this ID via the arangosync get status
output of DC A. There you can also tell if this is necessary, if you see an outgoing synchronization which does not have a corresponding incoming synchronization in the other DC, the ArangoSync abort outgoing sync command is needed.
Restart synchronization in the opposite direction:
arangosync configure sync --master.endpoint=https://sync-dc1-node01:8532 --master.cacert=tls-ca.crt --source.cacert=tls-ca.crt --master.keyfile=client-auth.keyfile --auth.user=root --auth.password=<password> --source.endpoint=https://sync-dc2-node01:8532
More details on the ArangoSync CLI and its options can be found in our documentation at:
https://www.arangodb.com/docs/stable/administration-dc2-dc.html
Concluding words
As I wrote at the very beginning of this blog post, we have thought long and hard about an ideal topic and we’re very excited that a lot of our users want to use ArangoSearch either to test or in production. We truly hope that this is a great start for those looking into a rock-solid replicated database environment and wish you serious heaps of fun rolling it out and check out the benefits!
Continue Reading
A Comprehensive Case-Study of GraphSage using PyTorchGeometric and Open-Graph-Benchmark
Get the latest tutorials,
blog posts and news:
Thanks for subscribing! Please check your email for further instructions.