Benchmark Results – ArangoDB vs. Neo4j : ArangoDB up to 8x faster than Neo4j
Introduction
This document presents the benchmark results comparing the ArangoDB’s Graph Analytics Engine (GAE) against Neo4j. The GAE is just one component of ArangoDB’s Data Science Suite.
This reproducible benchmark aims to provide a neutral and thorough comparison between the two databases, ensuring a fair and unbiased assessment.
We use the wiki-Talk dataset, a widely used, real-world graph dataset derived from the edit and discussion history of Wikipedia.
The wiki-Talk dataset encapsulates communication patterns between Wikipedia users, specifically interactions on user talk pages. This dataset is used frequently in benchmarking graph databases and graph analytics systems because of its unique characteristics. The key characteristics of wiki-Talk that make it a highly reliable benchmarking dataset are: Directed Graph, Nodes and Edges, Scale, Temporal Dimension, Sparsity, etc.
The results demonstrate the efficiency and scalability of each database, and offer a representative benchmark model for organizations evaluating graph databases for their needs.
Benchmark Highlights
The benchmark results reveal several notable insights, particularly highlighting ArangoDB's superior performance in graph analytics tasks compared to Neo4j. Most strikingly:
- ArangoDB consistently outperformed Neo4j across various graph computation algorithms, with performance improvements that range from 1.3 times to over 8 times faster.
- This substantial speed advantage is also evident in graph loading times, where ArangoDB demonstrated an impressive 100% advantage in graph loading efficiency vs Neo4j, for the wiki-Talk dataset.
ArangoDB's optimized data storage and retrieval, combined with its advanced query execution and effective use of clustered deployments, also contributed significantly to its superior performance in these scenarios.
These findings underscore:
- ArangoDB's capability to handle much larger-scale and far faster real-time graph analytics applications.
- ArangoDB as a much more compelling choice for industries and organizations that require rapid data processing and analysis, such as real-time recommendation systems, social network analysis, fraud detection, and cyber security.
Benchmark Overview
Datasets (wiki-Talk)
We utilized the wiki-Talk dataset, a well-regarded dataset for evaluating graph database performance. The chosen graphs and their details are as follows:
Graphs Used | Vertices | Edges |
---|---|---|
wiki-Talk | 2,394,385 | 5,021,410 |
Hardware
All tests were conducted on the same machine with the following specifications:
OS Ubuntu 23.10 (64-bit)
Memory 192 GB (4800 MHz)
CPU Ryzen 9 7950X3D (16 Cores, 32 Threads)
Database Configuration
***Neo4j***
Version 5.19.0 (Community Edition)
Deployment On-Premise, Single Process
***ArangoDB***
Version 3.12.0-NIGHTLY.20240305 (Community Edition)
Deployment On-Premise, Single Process
Graph Analytics Engine (GAE)
Version Latest
Deployment On-Premise, Single Process (RUST-based, no multithreading)
Benchmark Configuration
Two workflows were used to measure performance:
Workflow A:
- Create the in-memory representation
- Execute each algorithm once
- Measure the whole process
Workflow B
- Create the in-memory representation
- Measure graph creation time
- Execute each algorithm individually
- Measure computation time
Algorithms Tested
- Pagerank
- Weakly Connected Components (WCC)
- Strongly Connected Components (SCC)
- Label Propagation
Used Technologies
- JavaScript Framework: Vitest with tinybench
- Communication
- Neo4j: Official Neo4j JS driver ("neo4j-driver": "^5.18.0")
- GAE: Plain HTTPs requests using Axios ("axios": "^1.6.8")
Benchmark Results
Graph Loading (wiki-Talk)
Task | GAE (sec) | Neo4j (sec) | Times Faster |
---|---|---|---|
Load graph wiki-Talk | 9.9 | 18 | 1.8 x |
Load Graph wiki-Talk with Attributes | 10.7 | 19.2 | 1.8 x |
Graph Computation (wiki-Talk)
Task | GAE (sec) | Neo4j (sec) | Times Faster |
---|---|---|---|
Compute PageRank | 3.8 | 10.6 | 2.8 x |
Compute WCC | 2.3 | 4.5 | 1.7 x |
Compute SCC | 3.2 | 6.7 | 2.1 x |
Compute Label Propagation | 1.5 | 13 | 8.5 x |
Explanation of Elements
Graph Algorithms
- Pagerank, An algorithm that is used to rank nodes in a graph based on their connections, also commonly used in search engines.
- Weakly Connected Components (WCC), which identifies subsets of a graph where any two vertices are connected by paths, ignoring the direction of edges.
- Strongly Connected Components (SCC), Identifying subsets of a graph where every vertex is reachable from every other vertex within the same subset.
- Label Propagation, a semi-supervised learning algorithm for community detection in graphs, where nodes propagate their labels to their neighbors iteratively.
Reasons for ArangoDB’s Superior Performance
Several factors contribute to ArangoDB's superior performance:
The performance of ArangoDB on the Wiki-Talk dataset is attributed to specific architectural optimizations rather than on raw computational benchmarks. In this scenario, ArangoDB serves as a data storage system, while the computation is handled by the Graph Analytics Engine (GAE). The benchmark focuses on two key stages:
- Loading the data into the GAE
- Computation of algorithms within the GAE
Graph Loading Times
ArangoDB Side
ArangoDB’s graph loading times are optimized due to two primary factors:
- Parallel Data ExtractionArangoDB’s support for parallel data loading from both single and distributed systems is a big reason for data loading performance advantages. This capability lets teams scale to multiple machines, where increased parallelism gets you faster data transfer. By enabling efficient horizontal scaling, the system achieves significant performance improvements compared to approaches that are limited to sequential or that don’t leverage parallel extractions.
- Projections for Targeted Data TransferProjections allow ArangoDB to transmit only the data attributes required for analysis. So, if only edge IDs and a single attribute are needed, the system only extracts and transfers these fields, avoiding the overhead of transmitting entire documents. This reduces both the data volume and network latency during graph loading operations.
Graph Analytics Engine (GAE) Side
The GAE is built using RUST, and it processes the transferred data with high efficiency:
- Efficient Data Representation
The GAE stores graph data within highly optimized in-memory structures, reducing memory usage while at the same time maintaining extremely fast access speeds. Graphs are immediately ready for computation without unnecessary delays.
Advantages in the Workflow
These features deliver several tangible benefits, as shown during the benchmark:
- Fast and Parallel Data Extraction - Parallelism improves speed and scalability.
- Optimized Data Transfer with Projections - Only the required data is transmitted, minimizing overhead.
- Compact and Efficient In-Memory Representation in GA - High-performance graph computation with minimal memory footprint.
Clarifying the Benchmark Scope
It is important to note that the benchmark does not evaluate data insertion times into ArangoDB or computational tasks performed by ArangoDB itself. Instead, it assesses the efficiency of:
- Loading graph data from ArangoDB into the GAE.
- The GAE's ability to compute graph algorithms.
By highlighting these stages, the benchmark shows the advantages of ArangoDB’s design in supporting large-scale graph workflows through fast data loading and efficient interaction with the GAE.
Reproducibility of the Benchmark
This benchmark is 100% reproducible, ensuring consistent and verifiable results. These results reflect ArangoDB’s implementation per the precise specifications and configurations mentioned above. We welcome organizations to replicate the benchmark to ensure consistent results. To do this, follow these steps:
- First, set up the hardware environment with an Ubuntu 23.10 operating system, 192 GB of memory, and a Ryzen 9 7950X3D CPU.
- Install and configure the latest versions of Neo4j and ArangoDB using the provided Docker configurations. Use single-threaded (non-clustered) configurations for both.
- Next, utilize the wiki-Talk dataset for testing. Execute the specified graph algorithms (PageRank, WCC, SCC, Label Propagation) using the detailed workflows (A and B) outlined in the benchmark configuration above.
- Measure the in-memory graph creation and computation times, and compare the results for both databases. This method ensures that the benchmark can be reliably reproduced in different environments.
PLEASE NOTE: This benchmark requires the installation of the ArangoDB Graph Analytics Engine (GAE). As this code is not open source, please reach out to Corey Sommers at corey.sommers@arangodb.com to receive access to the GAE for the purposes of reproducing this benchmark in your environment (to ensure objectivity of results).
Conclusion
The benchmark results clearly demonstrate ArangoDB's far superior performance over Neo4j in the categories of graph computation and loading tasks. ArangoDB's significant speed advantages - particularly its ability to execute complex algorithms and load large datasets much faster - highlight its optimized architecture and efficient data handling.
These findings make ArangoDB a compelling choice for applications requiring high-performance graph analytics and real-time data processing.
Vector Search in ArangoDB: Practical Insights and Hands-On Examples
Estimated reading time: 5 minutes
Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach. The Vector Search capability is currently in Developer Preview and will be in production release in Q1, 2025.
This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.
(more…)Some Perspectives on HybridRAG in an ArangoDB World
Estimated reading time: 7 minutes
Introduction
Graph databases continue to gain momentum, thanks to their knack for handling intricate relationships and context. Developers and tech leaders are seeing the potential of pairing them with the creative strength of large language models (LLMs). This combination is opening the door to more precise, context-aware answers to natural language prompts. That’s where RAG comes in—it pulls in useful information, whether from raw text (VectorRAG) or a structured knowledge graph (GraphRAG), and feeds it into the LLM. The result? Smarter, more relevant responses that are grounded in actual data.
(more…)ArangoDB 3.12 – Performance for all Your Data Models
Estimated reading time: 6 minutes
We are proud to announce the GA release of ArangoDB 3.12!
Congrats to the team and community for the latest ArangoDB release 3.12! ArangoDB 3.12 is focused on greatly improving performance and observability both for the core database and our search offering. In this blog post, we will go through some of the most important changes to ArangoDB and give you an idea of how this can be utilized in your products.
(more…)Advanced Fraud Detection in Financial Services with ArangoDB and AQL
Estimated reading time: 3 minutes
Advanced Fraud Detection: ArangoDB’s AQL vs. Traditional RDBMS
In the realm of financial services, where fraud detection is both critical and complex, the choice of database and query language can impact the efficiency and effectiveness of fraud detection systems. Let’s explore how ArangoDB – a multi-model graph database – is powered by AQL (ArangoDB Query Language) to handle multiple, real-world fraud detection scenarios in a much more seamless and powerful way compared to traditional Relational Database Management Systems (RDBMS).
(more…)Update: Evolving ArangoDB’s Licensing Model for a Sustainable Future
Estimated reading time: 7 minutes
Update: https://arangodb.com/2023/10/evolving-arangodbs-licensing-model-for-a-sustainable- future/
Last October the first iteration of this blog post explained an update to ArangoDB’s 10-year-old license model. Thank you for providing feedback and suggestions. As mentioned, we will always remain committed to our community and hence today, we are happy to announce yet another update that integrates your feedback.
(more…)The world is a graph: How Fix reimagines cloud security using a graph in ArangoDB
'Guest Blog'
Estimated reading time: 5 minutes
In 2015, John Lambers, a Corporate Vice President and Security Fellow at Microsoft wrote “Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.ˮ
The original problem in cloud security is visibility into my assets. If security engineers donʼt know what cloud services are running, they canʼt protect an environment. Unfortunately, first generation cloud security products were built with a list mindset, i.e. “rows and columnsˮ. They generate a list of assets and their configurations – but show no context of the relationships between connected cloud services, such as as a connection that would allow lateral movement between two disparate cloud assets.
Cloud security as a graph
A graph database like ArangoDB provides a powerful way to represent and analyze complex relationships in cloud security.
A graph is the easiest way to understand how one entity in my cloud interacts with another. By representing cloud assets as nodes in a graph and the relationships between them as vertices, I can now gain a better understanding of the nested connections in my cloud infrastructure.
By thinking about cloud resources in terms of ancestors and descendants, a cloud security engineer can solve problems in a way a table canʼt. The graph is an easier way to visualize the relationships between users and any of my cloud resources such as compute instances, functions, storage buckets and databases.
- Ancestors: The graph helps me understand the root of a security issue. What is the highest ancestor where an issue was introduced? Because I need to go all the way up and fix the problem at its origin.
- Descendants: The other way around is understanding descendants and blast radius. If I have an Internet-exposed compute instance, where an attacker is maybe able to get credentials off that instance, how many hops can that attacker go in? How much of my infrastructure is exposed due to this initial compromise?
In a cloud-native world, these graph traversal capabilities are fundamental for cloud security. Going forward, any operating model for cloud security should be built on a graph. With Fix, weʼre building such a modern cloud security tool, and weʼre building it with ArangoDB.
But first, a list!
Now that we covered the benefits of using a graph for cloud security, letʼs start with a list. Yes, a list – because sometimes, viewing my cloud assets in a graph might not be the most intuitive or useful thing.
For example, I may just want a list of my compute instance inventory across my AWS accounts. As a cloud security engineer, I want a baseline inventory of resources. I don’t really need a picture for that, I just want the list. And maybe I want to download it in a spreadsheet so I can slice and dice it, with metadata for each particular instance like create date, number of vCPUs and memory. A list is the best way to represent that information.
But if a list is enough, why collect data in a graph in the first place?
Because transformation from a graph to a table is trivial. The other way around, not so much. The graph lets you express things in a way that if you had the same data in a flat table, it would become intractable, with many different tables, foreign key relationships, and creating all kinds of joints all over the place. It just becomes too difficult to reason about.
The hard part is collecting data from cloud APIs and putting it into a graph form. Thatʼs much harder, takes time and is easy to get wrong. There are enough opportunities to make mistakes along the way, and create a representation thatʼs not correct or has bugs. Thatʼs why we believe transparency in how a cloud security product collects data matters. Both ArangoDB and Fix are open source. Our code shows how we collect and store data from cloud APIs in ArangoDB.
Graph-based analysis of cloud resources
The analysis layer of a graph is powerful because it can provide insights that tables cannot. One recent trend in security is that software engineers also take on security engineering tasks. They look after the security of their infrastructure, beyond infrastructure-as-code templates.
While Fix offers out-of-the-box visualizations and pre-built checks of compliance rules, weʼve also built a search syntax on top of the ArangoDB Query Language (AQL). With ArangoDB and AQL, I can store and query rich nested JSON-like document together with their vertices. Itʼs also easier to add and query metadata to the vertices – such as configuration data for a cloud resource. By building our syntax on top of AQL, weʼve made Fix human-friendly. Developers can easily run ad-hoc checks of the security posture of their infrastructure.
For example, activating flow logs in your VPCs is considered a security best practice by AWS. The search below finds all AWS VPCs where flow flogs are deactivated.
is(aws_vpc) with(empty, --> is(aws_ec2_flow_log))
Breaking it down, the search:
- first, finds all resources of the kind “aws_vpcˮ, no matter in which account or region they may run.
- then, filters for the VPCs without a direct relationship (successor) to an “aws_ec2_flow_logˮ resource.
A simple one line statement.
The same query expressed in SQL would require joining different tables with nested select statements, multiple where-clauses and case statements. It would be dozens of lines long and require an engineer to have knowledge of the table architecture and column names.
The power of a graph is that it lets you explore many-to-many relationships in a very easy way, in a way that a traditional row-based database just canʼt. By making security data from cloud resources available in a graph, software engineers with security responsibilities can gain visibility into the environment and reduce risks.
A graph provides context, context is king
The partnership between Fix and the ArangoDB team has brought our customers new security insights only made possible by the multi-dimensional relations of cloud resources stored in a graph. With ArangoDB, using graphs is no longer a complex computer science and operational challenge. For Fix, ArangoDB provides a graph database as a building block that makes it easy to store and query the relationships in your data.
Fix uses ArangoDB to analyze billions of relationships – in every cloud. With ArangoDB, weʼve been able to build a system that can ingest data at scale. One of our retail users ingests data from tens of thousands of cloud accounts in minutes, and then runs any type of analytics in a fraction of a second. The context of the graph helps security engineers to precisely answer questions and identify, prioritize and remediate risks – the “trifectaˮ of cloud security.
The precision, speed, and explainability of finding risks to your business is simply not possible without using a graph. When defenders can think in graphs, attackers lose.
Reintroducing the ArangoDB-RDF Adapter
Introducing ArangoDB’s Data Loader : Revolutionizing Your Data Migration Experience
Estimated reading time: 7 minutes
At ArangoDB, our commitment to empowering companies, developers, and data enthusiasts with cutting edge tools and resources remains unwavering. Today, we’re thrilled to unveil our latest innovation, the Data Loader, a game-changing feature designed to simplify and streamline the migration of relational databases to ArangoGraph. Let’s dive into what makes Data Loader a must-have tool for your data migration needs.
(more…)Get the latest tutorials,
blog posts and news:
Thanks for subscribing! Please check your email for further instructions.