The Transformative Power of ArangoDB GraphRAG in Genomics-Driven Personalized Medicine

Estimated reading time: 7 minutes

Introduction

Personalized medicine is a truly disruptive innovation in healthcare. Medical treatment can now pivot from mass-market, standardized care models to custom-made, client-centric solutions. For example, healthcare providers can now offer precision-targeted therapeutic products and services using individual genetic data and lifestyle metrics. And this shift would drive improved patient outcomes. It would also open up brand-new market segments and revenue streams for healthcare organizations. Selling commoditized therapies turns into offering customized data-driven health management.

Personalized medicine challenges existing business models for stakeholders across the healthcare ecosystem. From pharmaceutical companies to insurers and care providers, personalized medicine becomes an opportunity for competitive differentiation.

In the long term, personalized medicine has the potential to redefine the entire healthcare value chain. New touchpoints for customer engagement emerge. Providers could cultivate a more proactive approach to health management. These therapies could be based on individual genetic profiles, environmental factors, and lifestyle choices.

At the intersection of cutting-edge genomics and advanced computational methods lies an opportunity to revolutionize patient care through technologies like GraphRAG (Graph Retrieval Augmented Generation). This white paper explores how ArangoDB's GraphRAG implementation offers unique capabilities for tackling complex challenges in personalized medicine, presenting high-impact applications with detailed analyses of challenges, solutions, and potential returns on investment. 

Knowledge graphs now converge with vector-based search methods. Healthcare providers can extract meaningful insights from complex, diverse, interconnected medical data. The goal is to deliver far more accurate diagnoses and efficacious treatments. Enhanced patient outcomes will emerge across a range of clinical scenarios.

The Promise and Challenges of Personalized Medicine

"The ability to sequence an entire human genome for less than the cost of a chest x-ray series has changed everything. We are entering an era where we will be able to provide truly personalized care based on an individual's genetic makeup. However, we are still in the infancy of understanding how to interpret and apply this vast amount of information."

                                         -Dr. Francis Collins, former director of the National Institutes of Health, key figure in the Human Genome Project. 

The human genome project's completion in 2003 promised a new era of medicine where treatments would be precisely calibrated to an individual's genetic makeup. Yet, two decades later, we still struggle to realize this vision fully. Why? Because biological systems are fiendishly complex, and the tools to navigate this complexity have been, until recently, woefully inadequate.

The challenge isn't a lack of data—quite the opposite. Modern healthcare systems are drowning in information: electronic health records, genome sequencing data, biomarker measurements, clinical trial results, and a constant torrent of new research findings. What's missing is the ability to connect these disparate data points in meaningful ways, to extract insights from the noise, and to present these insights in a format that supports clinical decision-making. 

Traditional databases struggle with this task because they weren't designed to handle the inherently interconnected nature of biological and medical knowledge. Relational databases force complex relationships into rigid tables, while document stores lack the structure to navigate connections efficiently. Vector databases can capture semantic similarities but miss critical relationship context.

This is where graph databases—and specifically ArangoDB's GraphRAG technology—enter the picture. By combining the relationship-focused power of knowledge graphs with the semantic capabilities of vector embeddings, GraphRAG offers a powerful new approach to personalized medicine challenges. The integration of large language models (LLMs) with knowledge graphs creates a system that retrieves relevant information and generates contextually appropriate insights and recommendations.

GraphRAG: Where Vector Retrieval meets Knowledge Graphs

Let's clarify what makes GraphRAG unique before diving into specific applications. Traditional Retrieval Augmented Generation (RAG) typically relies on vector embeddings to find content that is semantically similar. While effective for many applications, this approach treats documents as isolated units, missing the rich web of relationships between entities.

By contrast, GraphRAG structures information as interconnected nodes and edges in a knowledge graph. This allows for precise traversal of relationships—critical in medical contexts where understanding how entities relate to each other is often more important than finding similar text. 

When a doctor asks, "What treatments are effective for patients with this genetic variant?" they're not looking for semantically similar documents; they're asking for a specific traversal of the relationship between variants, conditions, and treatments.

ArangoDB's implementation takes this a step further by offering a multi-model database that combines the power of graph, document, and key-value structures in a single platform. This flexibility is particularly valuable in healthcare scenarios where different types of data—structured, semi-structured, and unstructured—must be integrated seamlessly.

The integration of graphs with LLMs adds another dimension. Natural language queries are translated into precise graph traversals, with the results contextualized and presented in human-readable form. Bidirectional translation—from natural language to graph queries and back—makes the system accessible to clinical users without requiring expertise in graph query languages.

Now, let's take a look at a couple of high-impact applications of GraphRAG in personalized medicine.

Application 1: Pharmacogenomics-Based Drug Selection

The Challenge

A patient's response to medications varies dramatically based on their genetic makeup. A drug that works perfectly for one patient might be ineffective or even dangerous for another due to variations in genes that encode drug-metabolizing enzymes, transporters, or target receptors. The field of pharmacogenomics addresses this variability, but implementing its insights in clinical practice remains difficult.

Let’s take Warfarin, a widely used blood thinner, as a prime example of our current data integration challenge. Proper dosing of this medication is critical, with a razor-thin margin between an ineffective dose and one that could cause dangerous bleeding.

Our current systems struggle to efficiently integrate key data:

  • Standard dosing protocols
  • Characteristics of individual patients
  • Genetic markers that can influence drug metabolism

Specifically, variations in two genes - CYP2C9 and VKORC1 - could require dose adjustments of up to 80% from standard protocols Without a system that can automatically flag these genetic variants and calculate adjusted dosing, we're leaving our clinicians to manually juggle complex data sets, increasing both cognitive load and the risk of errors. Similar challenges exist for numerous medications across therapeutic areas. 

Today, healthcare providers face several obstacles when trying to incorporate pharmacogenomic insights. As discussed earlier, the knowledge base is vast and evolves rapidly, with new gene-drug interactions published weekly. And the relevant information is scattered across databases, research papers, and clinical guidelines. Interpreting the clinical significance of specific genetic variants also requires highly specialized expertise. Finally, integrating pharmacogenomic data with other clinical factors (age, organ function, co-medications) adds much complexity for clinicians. 

Possible Solutions

Several approaches exist to address these challenges:

  1. Standalone pharmacogenomic decision support systems: These specialized tools focus exclusively on gene-drug interactions but often lack integration with broader clinical data
  2. Vector-based RAG systems: These can retrieve relevant literature based on semantic similarity but struggle with the precise relationship mapping needed for pharmacogenomic recommendations
  3. Rule-based expert systems: These encode explicit if-then rules for pharmacogenomic guidelines but are difficult to maintain as knowledge evolves
  4. ArangoDB's GraphRAG-type approach: This combines structured knowledge representation with flexible retrieval and natural language generation

Why ArangoDB GraphRAG excels for drug selection based on pharmaco-genomics

ArangoDB's GraphRAG offers distinct advantages for pharmacogenomic applications. It allows for powerful capabilities that traditional systems struggle to achieve. Multi-hop reasoning becomes possible. The system is able to connect genetic variants to enzymes to drugs to alternatives in a single query. With ArangoDB's GraphRAG, you can now traverse complex relationships, a significant improvement over conventional methods.

We are also able to preserve context. In a knowledge graph, the relationships between entities, such as how exactly a drug affects a genetic variant, are explicitly represented. Therefore, with a graph, we ensure that we don't lose crucial contextual information during the reasoning process.

Using ArangoDB's GraphRAG also allows us to integrate multiple data types. We can now cohesively query across structured variant data, unstructured clinical guidelines, and semi-structured patient records. This ability to seamlessly work with diverse data formats is crucial in the rather complex landscape of healthcare data management.

Dynamic knowledge updates become possible with ArangoDB's GraphRAG. 

In the rapidly evolving field of genetics, there are novel pharmacogenomic findings emerging weekly. The ArangoDB knowledge graph could be updated without the need to retrain the entire system. This flexibility ensures that the system stays current with the latest scientific discoveries, providing up-to-date insights for clinical decision-making.

Imagine a system that transforms how clinicians interact with patient data.  

Instead of manually cross-referencing multiple databases and guidelines, they could simply ask a question in natural language: 

"What antidepressants are recommended for this patient given their CYP2D6 gene's poor metabolizer status?" 

A platform built with ArangoDB's GraphRAG would access the data, query using AQL, and integrate data from multiple sources to answer this question. This would include the patient's electronic health record, the hospital's pharmacogenomic database, up-to-date clinical guidelines, and the latest research literature. We would get a comprehensive response from the platform's advanced analytics.   

This response would include a prioritized list of recommended medications, along with an underlying reason for each recommendation, that takes into consideration the patient's specific genetic profile. It would also highlight potential drug interactions based on the patient's current medications and suggest appropriate dosing adjustments.  

All of this information would be presented in a clear, actionable format for the clinician. Prescription errors would go down and the efficacy of therapies could be improved dramatically. We could really streamline clinical workflows.

We see a shift from passive data storage to active clinical decision support. This potentially reduces adverse drug events and associated costs while improving patient outcomes. Moreover, this system would be scalable across various medical specialties and adaptable as new genetic insights emerge, providing long-term value for the healthcare organization.

ROI Comparison 

Implementing pharmacogenomic guidance through different approaches yields varying returns: 

  1. Vector-only RAG: Can improve information retrieval but lacks the precision for clear recommendations, resulting in a modest 20-25% improvement in appropriate prescribing.
  2. ArangoDB GraphRAG: By combining precise relationship traversal with natural language interaction, adoption rates rise to 40-60%, with corresponding improvements in outcomes. One healthcare system reported an annual savings of $2.2M after implementing a GraphRAG-based pharmacogenomics approach.

The GraphRAG approach provides clear, contextual guidance that physicians can trust and easily incorporate into their workflow. This offers a significant ROI advantage over Vector-only RAG.

Application 2: Disease Risk Prediction and Prevention

The Challenge 

You need to integrate multiple data types to predict an individual's risk for complex diseases like Alzheimer's, diabetes, cancer, or heart disease. These include genetic risk variants, family history, environmental exposures, lifestyle factors, and biomarker measurements. Traditional risk calculators use simplified models that capture only a fraction of these interactions, while more sophisticated approaches often become "black boxes" that clinicians hesitate to trust. 

The challenges are many: 

  • Risk factors interact in complex, non-linear ways that simple scoring systems can't capture
  • Different risk factors operate on different time scales and with varying degrees of certainty
  • Preventive interventions need to be tailored to the specific combination of risk factors
  • Explaining risk assessments in an understandable way to clinicians is crucial for patient engagement

"We had a patient with a strong family history of breast cancer, but no identifiable BRCA1 or BRCA2 mutation. Her Tyrer-Cuzick risk score was only slightly elevated. But when we looked at her polygenic risk score, incorporating multiple moderate-risk variants, it put her at much higher risk. This case really highlighted for me how our traditional risk models might be missing important genetic contributions to cancer risk." 

                                     -  Dr. Judy Garber, Director of the Center for Cancer Genetics and                                                Prevention at Dana-Farber Cancer Institute, speaking at 2019 San Antonio  Breast Cancer Symposium.

Dr. Garber's comment clearly demonstrates the limitations of conventional approaches. 

Possible Solutions

We could approach the challenge in different ways: 

  • Statistical risk models: Frameworks like Framingham Risk Score or BOADICEA use statistical methods to combine risk factors, but they handle only a limited set of variables
  • Machine learning models: These can capture complex interactions but often function as black boxes, making explanation difficult.
  • Vector database approaches: These can retrieve similar cases but struggle to provide the causal reasoning needed to plan the intervention on the patient.
  • ArangoDB GraphRAG-type systems: These actually represent the causal relationships between risk factors, diseases, and interventions, enabling the clinician to both predict and explain.

Why ArangoDB GraphRAG Excels

ArangoDB's GraphRAG approach is uniquely suited to disease risk prediction. For example, you would build a query in AQL that is able to navigate through various types of risk factors, such as genetic risks, lifestyle risks, environmental risks and biomarker risks.

There are several advantages to this approach with ArangoDB’s GraphRAG. Firstly, we can represent causal relationships, not just correlation. 

The knowledge graph explicitly represents causal relationships between risk factors and diseases, enabling explanation rather than just predictions! Next, multi-modal integration is now possible with ArangoDB's GraphRAG approach. Genetic, environmental, and clinical data are integrated in a single model that preserves their relationships. 

The clinician could plan for personalized patient interventions. The system can recommend interventions to the clinician targeted at the specific risk factors identified for an individual patient. Finally, the graph structure allows us to generate natural language explanations that trace the path from risk factors to disease risk to interventions. 

ROI Comparison

Different approaches to disease risk prediction yield varying economic returns, based on implementations and studies: 

  • Traditional risk calculators: These improve risk stratification by 15-20% over clinical judgment alone, leading to modest improvements in preventive care utilization and an ROI of approximately 1.5:1.
  • ML-based models: These can improve prediction accuracy by 25-35% but face adoption challenges due to explainability issues to clinicians, resulting in an ROI of 2:1 when successfully implemented.
  • Vector-only approaches: These improve information retrieval but struggle with the causal reasoning needed for intervention planning by clinicians, limiting ROI to around 1.8:1.
  • ArangoDB's GraphRAG-type approach: By combining accurate risk prediction with explainable reasoning and targeted intervention recommendations, this approach has demonstrated ROI ratios of 3:1 to 4:1 in new implementations.

The superior ROI of GraphRAG comes from its ability to identify who is at risk but also explain why they're at risk. More importantly, it goes into what specifically can be done about it. Clinicians can now implement preventive interventions tailored to individual risk profiles! 

The Future of ArangoDB's GraphRAG in Personalized Medicine

The applications described in this white paper represent just the beginning of what's possible with ArangoDB's GraphRAG technology in personalized medicine. As healthcare continues to generate more data across modalities—genomics, proteomics, metabolomics, digital biomarkers, imaging, and electronic health records—the need for systems that can integrate and reason across these data types will only grow. 

We see that ArangoDB's GraphRAG technology offers a powerful approach to these challenges by combining the strengths of knowledge graphs, vector embeddings, and large language models. The multi-model nature of ArangoDB's graph database and its query language AQL, is particularly well-suited to the heterogeneous data landscape of healthcare, while the integration with natural language processing makes the system accessible to clinical users without specialized technical expertise.

Looking ahead, we can anticipate several trends in the evolution of GraphRAG for personalized medicine:

  • Increasingly automated knowledge graph construction: Tools that can automatically extract entities and relationships from the biomedical literature, reducing the manual curation burden
  • Multimodal integration: Incorporation of imaging data, sensor readings, and other non-textual modalities into the knowledge graph
  • Temporal reasoning: Enhanced capabilities for reasoning about changes over time, crucial for understanding disease progression and treatment response.
  • Distributed knowledge graphs: Federation across institutions to enable larger, more comprehensive knowledge structures while preserving privacy and governance.

As these technologies mature, the vision of truly personalized medicine—tailored not just to broad population groups but to each individual's unique biological, clinical, and environmental context—comes closer to reality. GraphRAG technologies like those offered by ArangoDB represent a crucial step toward that future, offering healthcare providers powerful tools to navigate the complexity of human biology and deliver more precise, effective care.

References

  1. ArangoDB. (2024). GraphRAG - ArangoDB. Retrieved from https://arangodb.com/graphrag/
  2. Yu, PhD MD. (2024, September 14). How GraphRAG Can Enhance Healthcare: Improving Medical... LinkedIn. Retrieved from https://www.linkedin.com/pulse/how-graphrag-can-enhance-healthcare-improving-medical-yu-phd-md--fk7ae
  3. E2E Networks. (2025, February 27). Healthcare Knowledge Graph RAG with Neo4j - E2E Networks. Retrieved from https://www.e2enetworks.com/blog/building-a-healthcare-knowledge-graph-rag-with-neo4j-langchain-and-llama-3
  4. ArangoDB Documentation. (2013, November 3). Example graphs | ArangoDB Documentation. Retrieved from https://docs.arangodb.com/3.11/graphs/example-graphs/
  5. Gradient Flow. (2024, August 15). GraphRAG Meets Finance: Enhancing Unstructured Data Analysis... Retrieved from https://gradientflow.com/graphrag-nvidia-blackrock/
  6. Santosa, A. (2024, August 20). The Role of GraphRAG in Modern Healthcare Systems. LinkedIn. Retrieved from https://www.linkedin.com/pulse/role-graphrag-modern-healthcare-systems-anindita-santosa-5rqxc
  7. YouTube. (2024, September 23). ArangoDB GraphRAG Technical Demo - YouTube. Retrieved from https://www.youtube.com/watch?v=2Izn5g22m_0
  8. ArangoDB. (2024, December 2). Data Science Suite Page - ArangoDB. Retrieved from https://arangodb.com/data-science-suite-page/
  9. Lu, S., & Cosgun, E. (2024, November 15). Boosting GPT Models for Genomics Analysis: Generating Trusted Genetic Variant Annotations and Interpretations through RAG and fine-tuning. bioRxiv. Retrieved from https://www.biorxiv.org/content/10.1101/2024.11.12.623275v1.full.pdf
  10. ArangoDB. (2024, November 26). Jupyter Notebooks - ArangoDB. Retrieved from https://arangodb.com/jupyter-notebooks/
  11. Data Graphs. (2025, January 1). Unlock Smarter Insights with GraphRAG AI - Data Graphs. Retrieved from https://datagraphs.com/use-cases/graphrag-ai
  12. ArangoDB. (2024, August 5). Decoded Health | Transforming Healthcare with ArangoDB. Retrieved from https://arangodb.com/solutions/case-studies/decoded-health-transforming-healthcare-with-ml-models-ontologies-and-graphs/
  13. Prism14. (2024, September 26). Top 3 Applications of GraphRAG Systems Across Different Fields. Retrieved from https://prism14.com/top-3-applications-of-graphrag-systems-across-different-fields/
More info...

Benchmark Results – ArangoDB vs. Neo4j : ArangoDB up to 8x faster than Neo4j

Introduction

This document presents the benchmark results comparing the ArangoDB’s Graph Analytics Engine (GAE) against Neo4j. The GAE is just one component of ArangoDB’s Data Science Suite. 

This reproducible benchmark aims to provide a neutral and thorough comparison between the two databases, ensuring a fair and unbiased assessment.

We use the wiki-Talk dataset, a widely used, real-world graph dataset derived from the edit and discussion history of Wikipedia

The wiki-Talk dataset encapsulates communication patterns between Wikipedia users, specifically interactions on user talk pages. This dataset is used frequently in benchmarking graph databases and graph analytics systems because of its unique characteristics. The key characteristics of wiki-Talk that make it a highly reliable benchmarking dataset are: Directed Graph, Nodes and Edges, Scale, Temporal Dimension, Sparsity, etc. 

The results demonstrate the efficiency and scalability of each database, and offer a representative benchmark model for organizations evaluating graph databases for their needs.

Benchmark Highlights

The benchmark results reveal several notable insights, particularly highlighting ArangoDB's superior performance in graph analytics tasks compared to Neo4j. Most strikingly:

  • ArangoDB consistently outperformed Neo4j across various graph computation algorithms, with performance improvements that range from 1.3 times to over 8 times faster.
  • This substantial speed advantage is also evident in graph loading times, where ArangoDB demonstrated an impressive 100% advantage in graph loading efficiency vs Neo4j, for the wiki-Talk dataset.

ArangoDB's optimized data storage and retrieval, combined with its advanced query execution and effective use of clustered deployments, also contributed significantly to its superior performance in these scenarios.

These findings underscore:

  • ArangoDB's capability to handle much larger-scale and far faster real-time graph analytics applications.
  • ArangoDB as a much more compelling choice for industries and organizations that require rapid data processing and analysis, such as real-time recommendation systems, social network analysis, fraud detection, and cyber security.

Benchmark Overview

Datasets (wiki-Talk)

We utilized the wiki-Talk dataset, a well-regarded dataset for evaluating graph database performance. The chosen graphs and their details are as follows:

Graphs UsedVerticesEdges
wiki-Talk2,394,3855,021,410

Hardware

All tests were conducted on the same machine with the following specifications:

          OS              Ubuntu 23.10 (64-bit)
          Memory    192 GB (4800 MHz)
          CPU           Ryzen 9 7950X3D (16 Cores, 32 Threads)

Database Configuration

           ***Neo4j***

          Version             5.19.0 (Community Edition)
          Deployment     On-Premise, Single Process

          ***ArangoDB***

         Version                3.12.0-NIGHTLY.20240305 (Community Edition)
         Deployment         On-Premise, Single Process

Graph Analytics Engine (GAE)

        Version                Latest
        Deployment        On-Premise, Single Process (RUST-based, no   multithreading)

Benchmark Configuration

    Two workflows were used to measure performance:

     Workflow A:

  1.  Create the in-memory representation
  2. Execute each algorithm once
  3. Measure the whole process

     Workflow B

  1. Create the in-memory representation
  2. Measure graph creation time
  3. Execute each algorithm individually
  4. Measure computation time

Algorithms Tested

  • Pagerank
  • Weakly Connected Components (WCC)
  • Strongly Connected Components (SCC)
  • Label Propagation

Used Technologies

  • JavaScript Framework: Vitest with tinybench
  • Communication
    • Neo4j: Official Neo4j JS driver ("neo4j-driver": "^5.18.0")
    • GAE: Plain HTTPs requests using Axios ("axios": "^1.6.8")

Benchmark Results

Graph Loading (wiki-Talk)

TaskGAE (sec)Neo4j (sec)Times Faster
Load graph wiki-Talk9.9181.8 x
Load Graph wiki-Talk with Attributes10.719.21.8 x

graph computation

Graph Computation (wiki-Talk)

TaskGAE (sec)Neo4j (sec)Times Faster
Compute PageRank3.810.62.8 x
Compute WCC2.34.51.7 x
Compute SCC3.26.72.1 x
Compute Label Propagation1.5138.5 x

Explanation of Elements

graph loading

Graph Algorithms

  • Pagerank, An algorithm that is used to rank nodes in a graph based on their connections, also commonly used in search engines. 
  • Weakly Connected Components (WCC), which identifies subsets of a graph where any two vertices are connected by paths, ignoring the direction of edges. 
  • Strongly Connected Components (SCC), Identifying subsets of a graph where every vertex is reachable from every other vertex within the same subset. 
  • Label Propagation, a semi-supervised learning algorithm for community detection in graphs, where nodes propagate their labels to their neighbors iteratively.

Reasons for ArangoDB’s Superior Performance

Several factors contribute to ArangoDB's superior performance:

The performance of ArangoDB on the Wiki-Talk dataset is attributed to specific architectural optimizations rather than on raw computational benchmarks. In this scenario, ArangoDB serves as a data storage system, while the computation is handled by the Graph Analytics Engine (GAE). The benchmark focuses on two key stages:

  1. Loading the data into the GAE
  2. Computation of algorithms within the GAE

Graph Loading Times

ArangoDB Side

ArangoDB’s graph loading times are optimized due to two primary factors:

  1.  Parallel Data ExtractionArangoDB’s support for parallel data loading from both single and distributed systems is a big reason for data loading performance advantages. This capability lets teams scale to multiple machines, where increased parallelism gets you faster data transfer. By enabling efficient horizontal scaling, the system achieves significant performance improvements compared to approaches that are limited to sequential or that don’t leverage parallel extractions.
  2.  Projections for Targeted Data TransferProjections allow ArangoDB to transmit only the data attributes required for analysis. So, if only edge IDs and a single attribute are needed, the system  only extracts and transfers these fields, avoiding the overhead of transmitting entire documents. This reduces both the data volume and network latency during graph loading operations.

Graph Analytics Engine (GAE) Side

The GAE is built using RUST, and it processes the transferred data with high efficiency:

  • Efficient Data Representation
    The GAE stores graph data within highly optimized in-memory structures, reducing memory usage while at the same time maintaining extremely fast access speeds. Graphs are immediately ready for computation without unnecessary delays.

Advantages in the Workflow

These features deliver several tangible benefits, as shown during the benchmark:

  1. Fast and Parallel Data Extraction - Parallelism improves speed and scalability. 
  2. Optimized Data Transfer with Projections - Only the required data is transmitted, minimizing overhead. 
  3. Compact and Efficient In-Memory Representation in GA - High-performance graph computation with minimal memory footprint.

Clarifying the Benchmark Scope

It is important to note that the benchmark does not evaluate data insertion times into ArangoDB or computational tasks performed by ArangoDB itself. Instead, it assesses the efficiency of:

  • Loading graph data from ArangoDB into the GAE.
  • The GAE's ability to compute graph algorithms.

By highlighting these stages, the benchmark shows the advantages of ArangoDB’s design in supporting large-scale graph workflows through fast data loading and efficient interaction with the GAE.

Reproducibility of the Benchmark

This benchmark is 100% reproducible, ensuring consistent and verifiable results. These results reflect ArangoDB’s implementation per the precise specifications and configurations mentioned above. We welcome organizations to replicate the benchmark to ensure consistent results. To do this, follow these steps:

  1. First, set up the hardware environment with an Ubuntu 23.10 operating system, 192 GB of memory, and a Ryzen 9 7950X3D CPU.
  2. Install and configure the latest versions of Neo4j and ArangoDB using the provided Docker configurations. Use single-threaded (non-clustered) configurations for both.
  3. Next, utilize the wiki-Talk dataset for testing. Execute the specified graph algorithms (PageRank, WCC, SCC, Label Propagation) using the detailed workflows (A and B) outlined in the benchmark configuration above.
  4. Measure the in-memory graph creation and computation times, and compare the results for both databases. This method ensures that the benchmark can be reliably reproduced in different environments.

PLEASE NOTE: This benchmark requires the installation of the ArangoDB Graph Analytics Engine (GAE). As this code is not open source, please reach out to Corey Sommers at corey.sommers@arangodb.com to receive access to the GAE for the purposes of reproducing this benchmark in your environment (to ensure objectivity of results).

Conclusion

The benchmark results clearly demonstrate ArangoDB's far superior performance over Neo4j in the categories of graph computation and loading tasks. ArangoDB's significant speed advantages - particularly its ability to execute complex algorithms and load large datasets much faster - highlight its optimized architecture and efficient data handling.

These findings make ArangoDB a compelling choice for applications requiring high-performance graph analytics and real-time data processing.

More info...

Vector Search in ArangoDB: Practical Insights and Hands-On Examples

Estimated reading time: 5 minutes

Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach. The Vector Search capability is currently in Developer Preview and will be in production release in Q1, 2025.

This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.

(more…)
More info...

Some Perspectives on HybridRAG in an ArangoDB World

Estimated reading time: 7 minutes

Introduction

Graph databases continue to gain momentum, thanks to their knack for handling intricate relationships and context. Developers and tech leaders are seeing the potential of pairing them with the creative strength of large language models (LLMs). This combination is opening the door to more precise, context-aware answers to natural language prompts. That’s where RAG comes in—it pulls in useful information, whether from raw text (VectorRAG) or a structured knowledge graph (GraphRAG), and feeds it into the LLM. The result? Smarter, more relevant responses that are grounded in actual data.

(more…)
More info...

ArangoDB vs. Neo4J

Estimated reading time: 7 minutes

Update: https://arangodb.com/2023/10/evolving-arangodbs-licensing-model-for-a-sustainable-
future/

Last October the first iteration of this blog post explained an update to ArangoDB’s 10-year-old license model. Thank you for providing feedback and suggestions. As mentioned, we will always remain committed to our community and hence today, we are happy to announce yet another update that integrates your feedback.

Your ArangoDB Team

ArangoDB as a company is firmly grounded in Open Source. The first commit was made in October 2011, and today we're very proud of having over 13,000 stargazers on GitHub. The ArangoDB community should be able to enjoy all of the benefits of using ArangoDB, and we have always offered a completely free community edition in addition to our paid enterprise offering.

With the evolving landscape of database technologies and the imperative to ensure ArangoDB remains sustainable, innovative, and competitive, we’re introducing some changes to our licensing model. These alterations will help us continue our commitment to the community, fuel further cutting-edge innovations and development, and assist businesses in obtaining the best from our platform. These alterations are based on changes in the broader database market.

Upcoming Changes

The changes to the licensing are in two primary areas:

  1. Distribution and Managed Services
  2. Commercial Use of Community Edition

Distribution and Managed Services

Effective version 3.12 of ArangoDB, the source code will replace its existing Apache 2.0 license with the BSL 1.1 for 3.12 and future versions.

BSL 1.1 is a source-available license that has three core tenets, some of which are customizable and specified by each licensor:   

  1. BSL v.1.1 will always allow copying, modification, redistribution, non-commercial use, and commercial use in a non-production context. 
  2. By default, BSL does not allow for production use unless the licensor provides a limited right as an “Additional Use Grant”; this piece is customizable and explained below. 
  3. BSL provides a Change Date usually between one to four years in which the BSL license converts to a Change License that is open source, which can be GNU General Public License (GPL), GNU Affero General Public License (AGPL), or Apache, etc.

ArangoDB has defined our Additional Use Grant to allow BSL-licensed ArangoDB source code to be deployed for any purpose (e.g. production) as long as you are not (i) creating a commercial derivative work or (ii) offering or including it in a commercial product, application, or service (e.g. commercial DBaaS, SaaS, Embedded or Packaged Distribution/OEM). We have set the Change Date to four (4) years, and the Change License to Apache 2.0.

These changes will not impact the majority of those currently using the ArangoDB source code but will protect ArangoDB against larger companies from providing a competing service using our source code or monetizing ArangoDB by embedding/distributing the ArangoDB software. 

As an example, If you use the ArangoDB source code and create derivative works of software based on ArangoDB and build/package the binaries yourself, you are free to use the software for commercial purposes as long as it is not a SaaS, DBaaS, or OEM distribution. You cannot use the Community Edition prepackaged binaries for any of the purposes mentioned above.

Commercial Use of Community Edition

We are also making changes to our Community Edition with the prepackaged ArangoDB binaries available for free on our website. Where before this edition was governed by the same Apache 2.0 license as the source code, it will now be governed by a new ArangoDB Community License, which limits the use of community edition for commercial purposes to a  100GB limit on dataset size in production within a single cluster and a maximum of three clusters. 

Commercial use describes any activity in which you use a product or service for financial gain. This includes whenever you use software to support your customers or products,  since that software is used for business purposes with the intent of increasing sales or supporting customers. This explicitly does not apply to non-profit organizations.

As an example, if you deploy software in production that uses ArangoDB as a database,  the database size is under 100 GB per cluster, and it is limited to a maximum of three clusters within an organization. Even though the software is commercially used, you have no commercial obligation to ArangoDB because it falls under the allowed limits. Similarly, non-production deployments such as QA, Test, and Dev using community edition create no commercial obligations to ArangoDB.

Our Enterprise Edition will continue to be governed by the existing ArangoDB Enterprise License.

What should Community users do?

The license changes will roll out and be effective with the release of 3.12 slated for the end of Q1 2024, and there will be no immediate impact to any releases prior to 3.12. Once the license changes are fully applied, there will be a few impacts:

  • If you are using Community Edition or Source Code for your managed service (DBaaS, SaaS), you will be unable to do so for future versions of ArangoDB starting with version 3.12.
  • If you are using Community Edition or Source Code and distributing it to your customers along with your software, you will be unable to do so for future versions of ArangoDB starting with version 3.12.
  • If you are using the Community Edition for commercial purposes for any production deployment either storing greater than 100 GB of data per cluster or having more than three clusters or both - you are required to have a commercial agreement with ArangoDB starting with version 3.12.

If any of these apply to you and you want to avoid future disruption, we encourage you to contact us so that we can work with you to find a commercially acceptable solution for your business.

How is ArangoDB easing the transition for community users with this change?

ArangoDB is willing to make concessions for community users to help them with the transition and the license change. Our joint shared goal is to both enable ArangoDB to continue commercially as the primary developer of the CE edition and still allow our CE users to have successful deployments that meet their business and commercial goals. Support from Arango and help with ongoing help with your deployments (Our Customer Success Team) allows us to maintain the quality of deployments and, ultimately, a more satisfying experience for users.

We do not intend to create hardship for the community users and are willing to discuss reasonable terms and conditions for commercial use.

ArangoDB can offer two solutions to meet your commercial use needs:

  1. Enterprise License: Provide a full-fledged enterprise license for your commercial use with all the enterprise features along with Enterprise SLA and Support.
  2. Community Transition We do not intend to create hardship for the community users and hence created a 'CE Transition Fund', which can be allocated by mutual discussion to ease the transition. This will allow us to balance the value that CE brings to an organization and the Support/Features available.

Summary

Our commitment to open-source ideals remains unshaken. Adjusting our model is essential to ensure ArangoDB’s longevity and to provide you with the cutting-edge features you expect from us. We continue to uphold our vision of an inclusive, collaborative, and innovative community. This change ensures we can keep investing in our products and you, our valued community.

Frequently Asked Questions

1. Does this affect the commercially packaged editions of your software such as Arango Enterprise Edition, and ArangoGraph Insights Platform? 

No, this only affects ArangoDB source code and ArangoDB Community Edition. 

2. Whom does this change primarily impact?

This has no effect on most paying customers, as they already license ArangoDB under a commercial license. This change also has no effect on users who use ArangoDB for non-commercial purposes. This change affects open-source users who are using ArangoDB for commercial purposes and/or distributing and monetizing ArangoDB with their software.

3: Why change now?

ArangoDB 3.12 is a breakthrough release that includes improved performance, resilience, and memory management. These highly appealing design changes may motivate third parties to fork ArangoDB source code in order to create their own commercial derivative works without giving back to the developer community. We feel it is in the best interest of the community and our customers to avoid that outcome. 

4: In four years, after the Change Date, can I make my own commercial product from ArangoDB 3.12 source code under Apache 2.0?  

Yes, if you desire.

5: Is ArangoDB still an Open Source company?

Yes. While the BSL 1.1 is not an official open source license approved by the Open Source Initiative (OSI), we still license a large amount of source code under an open source license such as our Drivers, Kube-Arango Operator, Tools/Utilities, and we continue to host ArangoDB-related open source projects.  Furthermore, the BSL only restricts the use of our source code if you are trying to commercialize it. Finally, after four years, the source code automatically converts to an OSI-approved license (Apache 2.0). 

6: How does the license change impact other products, specifically the kube-arango operator?

There are two versions of the kube-arango operator: the Community and the Enterprise versions. At this time there are no plans to change licensing terms for the operator. The operator will, however, automatically enforce the licensing depending upon the ArangoDB version under management (enterprise or community).

More info...

ArangoDB 3.12 – Performance for all Your Data Models

Estimated reading time: 6 minutes

We are proud to announce the GA release of ArangoDB 3.12!

Congrats to the team and community for the latest ArangoDB release 3.12! ArangoDB 3.12 is focused on greatly improving performance and observability both for the core database and our search offering. In this blog post, we will go through some of the most important changes to ArangoDB and give you an idea of how this can be utilized in your products.

(more…)
More info...

Advanced Fraud Detection in Financial Services with ArangoDB and AQL

Estimated reading time: 3 minutes

Advanced Fraud Detection: ArangoDB’s AQL vs. Traditional RDBMS

In the realm of financial services, where fraud detection is both critical and complex, the choice of database and query language can impact the efficiency and effectiveness of fraud detection systems. Let’s explore how ArangoDB – a multi-model graph database – is powered by AQL (ArangoDB Query Language) to handle multiple, real-world fraud detection scenarios in a much more seamless and powerful way compared to traditional Relational Database Management Systems (RDBMS).

(more…)
More info...

Update: Evolving ArangoDB’s Licensing Model for a Sustainable Future

Estimated reading time: 7 minutes

Updated 3/28/25 for accuracy.

Last October the first iteration of this blog post explained an update to ArangoDB’s 10-year-old license model. Thank you for providing feedback and suggestions. As mentioned, we will always remain committed to our community and hence today, we are happy to announce yet another update that integrates your feedback.

Your ArangoDB Team

ArangoDB as a company is firmly grounded in Open Source. The first commit was made in October 2011, and today we're very proud of having over 13,000 stargazers on GitHub. The ArangoDB community should be able to enjoy all of the benefits of using ArangoDB, and we have always offered a completely free community edition in addition to our paid enterprise offering.

With the evolving landscape of database technologies and the imperative to ensure ArangoDB remains sustainable, innovative, and competitive, we’re introducing some changes to our licensing model. These alterations will help us continue our commitment to the community, fuel further cutting-edge innovations and development, and assist businesses in obtaining the best from our platform. These alterations are based on changes in the broader database market.

Upcoming Changes

The changes to the licensing are in two primary areas:

  1. Distribution and Managed Services
  2. Commercial Use of Community Edition

Distribution and Managed Services

Effective version 3.12 of ArangoDB, the source code will replace its existing Apache 2.0 license with the BSL 1.1 for 3.12 and future versions.

BSL 1.1 is a source-available license that has three core tenets, some of which are customizable and specified by each licensor:   

  1. BSL v.1.1 will always allow copying, modification, redistribution, non-commercial use.
  2. By default, BSL does not allow for production use unless the licensor provides a limited right as an “Additional Use Grant”; this piece is customizable and explained below. 
  3. BSL provides a Change Date usually between one to four years in which the BSL license converts to a Change License that is open source, which can be GNU General Public License (GPL), GNU Affero General Public License (AGPL), or Apache, etc.

ArangoDB has defined our Additional Use Grant to allow BSL-licensed ArangoDB source code to be deployed for any purpose (e.g. production) as long as you are not (i) creating a commercial derivative work or (ii) offering or including it in a commercial product, application, or service (e.g. commercial DBaaS, SaaS, Embedded or Packaged Distribution/OEM). We have set the Change Date to four (4) years, and the Change License to Apache 2.0.

These changes will not impact the majority of those currently using the ArangoDB source code but will protect ArangoDB against larger companies from providing a competing service using our source code or monetizing ArangoDB by embedding/distributing the ArangoDB software. 

As an example, If you use the ArangoDB source code and create derivative works of software based on ArangoDB and build/package the binaries yourself, you are free to use the software for commercial purposes as long as it is not a SaaS, DBaaS, or OEM distribution. You cannot use the Community Edition prepackaged binaries for any of the purposes mentioned above.

What should Community users do?

The license changes will roll out and be effective with the release of 3.12 slated for the end of Q1 2024, and there will be no immediate impact to any releases prior to 3.12. Once the license changes are fully applied, there will be a few impacts:

  • If you are using Community Edition or Source Code for your managed service (DBaaS, SaaS), you will be unable to do so for future versions of ArangoDB starting with version 3.12.
  • If you are using Community Edition or Source Code and distributing it to your customers along with your software, you will be unable to do so for future versions of ArangoDB starting with version 3.12.
  • If you are using the Community Edition for commercial purposes you are required to have a commercial agreement with ArangoDB starting with version 3.12.

If any of these apply to you and you want to avoid future disruption, we encourage you to contact us so that we can work with you to find a commercially acceptable solution for your business.

How is ArangoDB easing the transition for community users with this change?

ArangoDB is willing to make concessions for community users to help them with the transition and the license change. Our joint shared goal is to both enable ArangoDB to continue commercially as the primary developer of the CE edition and still allow our CE users to have successful deployments that meet their business and commercial goals. Support from Arango and help with ongoing help with your deployments (Our Customer Success Team) allows us to maintain the quality of deployments and, ultimately, a more satisfying experience for users.

We do not intend to create hardship for the community users and are willing to discuss reasonable terms and conditions for commercial use.

ArangoDB can offer two solutions to meet your commercial use needs:

  1. Enterprise License: Provide a full-fledged enterprise license for your commercial use with all the enterprise features along with Enterprise SLA and Support.
  2. Community Transition We do not intend to create hardship for the community users and hence created a 'CE Transition Fund', which can be allocated by mutual discussion to ease the transition. This will allow us to balance the value that CE brings to an organization and the Support/Features available.

Summary

Adjusting our model is essential to ensure ArangoDB’s longevity and to provide you with the cutting-edge features you expect from us. We continue to uphold our vision of an inclusive, collaborative, and innovative community. This change ensures we can keep investing in our products and you, our valued community.

Frequently Asked Questions

1. Does this affect the commercially packaged editions of your software such as Arango Enterprise Edition, and ArangoGraph Insights Platform? 

No, this only affects ArangoDB source code and ArangoDB Community Edition. 

2. Whom does this change primarily impact?

This has no effect on most paying customers, as they already license ArangoDB under a commercial license. This change also has  no effect on users who use ArangoDB for non-commercial  purposes. This change affects community edition  users who are  using  ArangoDB for commercial purposes and/or distributing and monetizing ArangoDB with their software.

3.Why change now?

ArangoDB 3.12 is a breakthrough release that includes improved performance, resilience, and memory management. These highly appealing design changes may motivate third parties to fork ArangoDB source code in order to create their own commercial derivative works without giving back to the developer community. We feel it is in the best interest of the community and our customers to avoid that outcome. 

4. Is ArangoDB still an Open Source company?

Yes. While the BSL 1.1 is not an official open source license approved by the Open Source Initiative (OSI), we still license a large amount of source code under an open source license such as our Drivers, Kube-Arango Operator, Tools/Utilities, and we continue to host ArangoDB-related open source projects.  Furthermore, the BSL only restricts the use of our source code if you are trying to commercialize it. Finally, after four years, the source code automatically converts to an OSI-approved license (Apache 2.0). 

5. How does the license change impact other products, specifically the kube-arango operator?

There are two versions of the kube-arango operator: the Community and the Enterprise versions. At this time there are no plans to change licensing terms for the operator. The operator will, however, automatically enforce the licensing depending upon the ArangoDB version under management (enterprise or community).

 

More info...

The world is a graph: How Fix reimagines cloud security using a graph in ArangoDB

'Guest Blog'

Estimated reading time: 5 minutes

In 2015, John Lambers, a Corporate Vice President and Security Fellow at Microsoft wrote “Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.ˮ

The original problem in cloud security is visibility into my assets. If security engineers donʼt know what cloud services are running, they canʼt protect an environment. Unfortunately, first generation cloud security products were built with a list mindset, i.e. “rows and columnsˮ. They generate a list of assets and their configurations – but show no context of the relationships between connected cloud services, such as as a connection that would allow lateral movement between two disparate cloud assets.

Cloud security as a graph

A graph database like ArangoDB provides a powerful way to represent and analyze complex relationships in cloud security.

A graph is the easiest way to understand how one entity in my cloud interacts with another. By representing cloud assets as nodes in a graph and the relationships between them as vertices, I can now gain a better understanding of the nested connections in my cloud infrastructure.

By thinking about cloud resources in terms of ancestors and descendants, a cloud security engineer can solve problems in a way a table canʼt. The graph is an easier way to visualize the relationships between users and any of my cloud resources such as compute instances, functions, storage buckets and databases.

  • Ancestors: The graph helps me understand the root of a security issue. What is the highest ancestor where an issue was introduced? Because I need to go all the way up and fix the problem at its origin.
  • Descendants: The other way around is understanding descendants and blast radius. If I have an Internet-exposed compute instance, where an attacker is maybe able to get credentials off that instance, how many hops can that attacker go in? How much of my infrastructure is exposed due to this initial compromise?

In a cloud-native world, these graph traversal capabilities are fundamental for cloud security. Going forward, any operating model for cloud security should be built on a graph. With Fix, weʼre building such a modern cloud security tool, and weʼre building it with ArangoDB.

But first, a list!

Now that we covered the benefits of using a graph for cloud security, letʼs start with a list. Yes, a list – because sometimes, viewing my cloud assets in a graph might not be the most intuitive or useful thing.

For example, I may just want a list of my compute instance inventory across my AWS accounts. As a cloud security engineer, I want a baseline inventory of resources. I don’t really need a picture for that, I just want the list. And maybe I want to download it in a spreadsheet so I can slice and dice it, with metadata for each particular instance like create date, number of vCPUs and memory. A list is the best way to represent that information.

But if a list is enough, why collect data in a graph in the first place?

Because transformation from a graph to a table is trivial. The other way around, not so much. The graph lets you express things in a way that if you had the same data in a flat table, it would become intractable, with many different tables, foreign key relationships, and creating all kinds of joints all over the place. It just becomes too difficult to reason about.

The hard part is collecting data from cloud APIs and putting it into a graph form. Thatʼs much harder, takes time and is easy to get wrong. There are enough opportunities to make mistakes along the way, and create a representation thatʼs not correct or has bugs. Thatʼs why we believe transparency in how a cloud security product collects data matters. Both ArangoDB and Fix are open source. Our code shows how we collect and store data from cloud APIs in ArangoDB.

Graph-based analysis of cloud resources

The analysis layer of a graph is powerful because it can provide insights that tables cannot. One recent trend in security is that software engineers also take on security engineering tasks. They look after the security of their infrastructure, beyond infrastructure-as-code templates.

While Fix offers out-of-the-box visualizations and pre-built checks of compliance rules, weʼve also built a search syntax on top of the ArangoDB Query Language (AQL). With ArangoDB and AQL, I can store and query rich nested JSON-like document together with their vertices. Itʼs also easier to add and query metadata to the vertices – such as configuration data for a cloud resource. By building our syntax on top of AQL, weʼve made Fix human-friendly. Developers can easily run ad-hoc checks of the security posture of their infrastructure.

For example, activating flow logs in your VPCs is considered a security best practice by AWS. The search below finds all AWS VPCs where flow flogs are deactivated.

 is(aws_vpc) with(empty, --> is(aws_ec2_flow_log))

Breaking it down, the search:

  • first, finds all resources of the kind “aws_vpcˮ, no matter in which account or region they may run.
  • then, filters for the VPCs without a direct relationship (successor) to an “aws_ec2_flow_logˮ resource.

A simple one line statement.

The same query expressed in SQL would require joining different tables with nested select statements, multiple where-clauses and case statements. It would be dozens of lines long and require an engineer to have knowledge of the table architecture and column names.

The power of a graph is that it lets you explore many-to-many relationships in a very easy way, in a way that a traditional row-based database just canʼt. By making security data from cloud resources available in a graph, software engineers with security responsibilities can gain visibility into the environment and reduce risks.

A graph provides context, context is king

The partnership between Fix and the ArangoDB team has brought our customers new security insights only made possible by the multi-dimensional relations of cloud resources stored in a graph. With ArangoDB, using graphs is no longer a complex computer science and operational challenge. For Fix, ArangoDB provides a graph database as a building block that makes it easy to store and query the relationships in your data.

Fix uses ArangoDB to analyze billions of relationships – in every cloud. With ArangoDB, weʼve been able to build a system that can ingest data at scale. One of our retail users ingests data from tens of thousands of cloud accounts in minutes, and then runs any type of analytics in a fraction of a second. The context of the graph helps security engineers to precisely answer questions and identify, prioritize and remediate risks – the “trifectaˮ of cloud security.

The precision, speed, and explainability of finding risks to your business is simply not possible without using a graph. When defenders can think in graphs, attackers lose.

 

More info...

Reintroducing the ArangoDB-RDF Adapter

Estimated reading time: 1 minute

ArangoRDF allows you to export Graphs from ArangoDB into RDFLib, the standard library for working with Resource Description Framework (RDF) in Python, and vice-versa.

(more…)
More info...

Get the latest tutorials,
blog posts and news: