Vector Search in ArangoDB: Practical Insights and Hands-On Examples

Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach.

This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.

What is Vector Search and Why Does it Matter?

Vector search lets you find similar items in a dataset by comparing their embeddings. Embeddings are essentially compact numerical representations of data—like a fingerprint for each data point—that capture its essence in a way that machine learning models can process.

For instance:

  • Text: An embedding might represent the meaning of a sentence.
  • Images: An embedding might capture the general appearance of an object.
  • Audio: An embedding might represent the rhythm or tone of a sound.

Traditional search methods like keyword matching or exact lookups can’t handle the subtle relationships captured by embeddings. Vector search fills that gap, finding semantically similar results even if the original data is complex or unstructured.

Setting Up Vector Search in ArangoDB

To get started, you need to create a vector index, which makes searching embeddings fast and efficient.

Step 1: Create a Vector Index in AQL

Imagine you have a collection called items where each document includes a vector embedding stored in the attribute vector_data. To set up an index using arangosh:

>>> db.items.ensureIndex(
{
name: “vector_cosine”
type: “vector”
fields: [“vector_data”]
params: { metric: “cosine”, dimension: 128, nLists: 100 }
}

Explanation:

  • type: "vector" specifies this is a vector index.
  • dimension: 128 indicates the size of the embeddings
  • metric: "cosine" defines the similarity measurement (another can be l2).
  • nLists: 100 defines the number of clusters used in the index

This step prepares the collection for efficient similarity searches.

Now, let’s query the items collection to find the five most similar embeddings to a query vector

[0.1, 0.3, 0.5, ...]:

LET query = [0.1, 0.3, 0.5, ...]

FOR doc IN items

LET score = APPROX_NEAR_COSINE(doc.vector_data, query)

SORT score DESC

LIMIT 5

RETURN {doc, score}

Explanation:

  • SCORE: The cosine distance between the query vector and the document’s as a number between 0 and 1. The closer the score is to 1, the closer the query vector is to the document.
  • APPROX_NEAR_COSINE: Compares the vector_data attribute of each document with the query vector using Approximate Nearest Neighbor search via Cosine distance.
  • SORT: Orders the results by similarity score, descending.
  • LIMIT: Restricts the results to the top 5 matches.

Combining Vector Search with Graph Traversals

One of ArangoDB’s strengths is its multi-model capability, which allows you to combine vector search with graph traversal. For example, in a fraud detection scenario, you might:

  1. Use vector search to find similar case descriptions.
  2. Traverse the graph to uncover related entities (e.g., linked transactions or individuals).

Example: Vector Search + Graph Traversal

LET query = [0.1, 0.3, 0.5, ...]

FOR doc IN items

LET score = APPROX_NEAR_COSINE(doc.vector_data, query)

SORT score DESC

LIMIT 5

LET related_nodes = (

FOR v, e, p IN 1..2 ANY doc GRAPH 'fraud_graph'

RETURN v

)

RETURN {doc, score, related_nodes}

Explanation:

  • The first query finds documents similar to the query vector (just like the example above).
  • The LET sub-query performs a graph traversal on the results, fetching nodes related to each document.
  • The final RETURN combines the document, similarity score, and its related graph data.

GraphRAG: Combining Vector Search and Knowledge Graphs

GraphRAG (Graph-based Retrieval-Augmented Generation) combines vector search with knowledge graphs to enhance natural language query handling. It retrieves both semantically similar results (using vector embeddings) and highly structured insights (via graph traversal), making it ideal for use cases like law enforcement, fraud detection, and advanced recommendation systems.

How to Implement GraphRAG with ArangoDB

  1. Store Embeddings and Relationships
    • Store vector embeddings in Document Collections (either Vertices or Edges).
    • Organize Entities and their Relationships in the Graph.
  2. Set Up Query Pipeline
    • Use Vector Search to find semantically similar items.
    • Traverse the Graph to uncover related Entities and Relationships.
  3. Combine with an LLM (Large Language Model)
    • Use the results to provide context for an LLM, enabling more accurate and context-aware responses.

Natural Language Querying with LangChain

To allow users to query using natural language, you can integrate LangChain with ArangoDB. LangChain converts a user’s natural language input into structured AQL queries. Here’s how you might implement this:

Step 1: Define the Workflow

  • User inputs a query like: "Find cases similar to Cybercrime A1425 and show related transactions."
  • LangChain processes the query to understand its intent and structure.
  • The tool generates an AQL query combining vector search and graph traversal.

Step 2: Example LangChain Integration

from arango import ArangoClient

from langchain_openai import ChatOpenAI

from langchain_community.graphs import ArangoGraph

from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain

# Initialize ArangoDB connection

client = ArangoClient("http://localhost:8529")

db = client.db(username=”root”, password=”test”)

# Select the LLM of choice

llm = ChatOpenAI(temperature=0, model_name="gpt-4")

# Define the natural language interface chain

chain = ArangoGraphQAChain.from_llm(

llm=llm, graph=ArangoGraph(self.db)

)

# Invoke the chain interface

response = chain.invoke(“find cases similar to Cybercrime A1425 & their related transactions”)

print(response)

Explanation:

  • LangChain generates the AQL query based on the user’s input.
  • The generated query could combine vector search and graph traversal, as shown earlier.
  • The result is sent back to the user as structured insights.

Why Combine Vector Search with Graph?

By pairing vector search with graph traversal, you get the best of both worlds:

  • Vector Search: Excels at retrieving semantically similar, unstructured data.
  • Graph Traversal: Shines when exploring structured relationships.

For instance:

  • In fraud detection, vector search finds similar cases, while graph traversal uncovers linked transactions and actors.
  • In law enforcement, vector search identifies relevant documents, and graph traversal maps connections between suspects.

HybridGraphRAG extends the power of ArangoDB by combining three advanced retrieval mechanisms: vector search, graph traversal, and full-text search. This hybrid approach ensures you can handle complex, multi-dimensional queries that involve both semantic similarity and structured data relationships.

Why Use HybridGraphRAG?

When combining these technologies, you can:

  • Retrieve semantically similar documents using vector search.
  • Explore relationships between entities through graph traversal.
  • Match specific keywords or phrases using full-text search.

This approach is ideal for applications like fraud detection, law enforcement, or personalized recommendation systems where structured and unstructured data complement each other.


How to Implement HybridGraphRAG in AQL

Let’s walk through an example where you:

  1. Use full-text search to find documents mentioning "cybercrime” and “threat”.
  2. Use vector search to retrieve documents similar to the query embedding.
  3. Use graph traversals to find relationships between the retrieved documents.

Combined Query Example:

LET query = [0.1, 0.3, 0.5, ...]

LET text_matches = (

FOR doc IN itemsView

SEARCH PHRASE(doc.text, [“cybercrime”, “threat”], “text_en”)

RETURN doc._id

)

FOR doc IN items

FILTER doc._id IN text_matches

LET score = APPROX_NEAR_COSINE(doc.vector_data, query)

SORT score DESC

LIMIT 5

LET related_nodes = (

FOR v, e, p IN 1..2 ANY doc GRAPH 'fraud_graph'

RETURN v

)

RETURN {doc, score, related_nodes}

Explanation:

  • itemsView: An ArangoDB View representing the inverted index of doc.text.
  • SEARCH PHRASE(...): Matches documents with the keyword “cybercrime”.
  • APPROX_NEAR_COSINE: Compares each doc.vector_data with the query vector using Approximate Nearest Neighbor search via Cosine distance.
  • LET related_nodes: Fetches the 1-to-2-hop neighborhood of the matching documents.

Conclusion

ArangoDB’s vector search, powered by FAISS, is more than a standalone feature—it’s a force multiplier for combining advanced data science techniques with graph-based insights. Whether you’re implementing natural language interfaces with LangChain or building hybrid query pipelines for real-world problems, the integration of vector search into ArangoDB’s multi-model system opens up endless possibilities.

Get started by creating your vector index, crafting AQL queries, and exploring what’s possible when you blend vectors with graphs. The tools are ready—now it’s up to you to build something amazing.

More info...

ArangoDB vs. Neo4J

Estimated reading time: 7 minutes

Update: https://arangodb.com/2023/10/evolving-arangodbs-licensing-model-for-a-sustainable-
future/

Last October the first iteration of this blog post explained an update to ArangoDB’s 10-year-old license model. Thank you for providing feedback and suggestions. As mentioned, we will always remain committed to our community and hence today, we are happy to announce yet another update that integrates your feedback.

Your ArangoDB Team

ArangoDB as a company is firmly grounded in Open Source. The first commit was made in October 2011, and today we're very proud of having over 13,000 stargazers on GitHub. The ArangoDB community should be able to enjoy all of the benefits of using ArangoDB, and we have always offered a completely free community edition in addition to our paid enterprise offering.

With the evolving landscape of database technologies and the imperative to ensure ArangoDB remains sustainable, innovative, and competitive, we’re introducing some changes to our licensing model. These alterations will help us continue our commitment to the community, fuel further cutting-edge innovations and development, and assist businesses in obtaining the best from our platform. These alterations are based on changes in the broader database market.

Upcoming Changes

The changes to the licensing are in two primary areas:

  1. Distribution and Managed Services
  2. Commercial Use of Community Edition

Distribution and Managed Services

Effective version 3.12 of ArangoDB, the source code will replace its existing Apache 2.0 license with the BSL 1.1 for 3.12 and future versions.

BSL 1.1 is a source-available license that has three core tenets, some of which are customizable and specified by each licensor:   

  1. BSL v.1.1 will always allow copying, modification, redistribution, non-commercial use, and commercial use in a non-production context. 
  2. By default, BSL does not allow for production use unless the licensor provides a limited right as an “Additional Use Grant”; this piece is customizable and explained below. 
  3. BSL provides a Change Date usually between one to four years in which the BSL license converts to a Change License that is open source, which can be GNU General Public License (GPL), GNU Affero General Public License (AGPL), or Apache, etc.

ArangoDB has defined our Additional Use Grant to allow BSL-licensed ArangoDB source code to be deployed for any purpose (e.g. production) as long as you are not (i) creating a commercial derivative work or (ii) offering or including it in a commercial product, application, or service (e.g. commercial DBaaS, SaaS, Embedded or Packaged Distribution/OEM). We have set the Change Date to four (4) years, and the Change License to Apache 2.0.

These changes will not impact the majority of those currently using the ArangoDB source code but will protect ArangoDB against larger companies from providing a competing service using our source code or monetizing ArangoDB by embedding/distributing the ArangoDB software. 

As an example, If you use the ArangoDB source code and create derivative works of software based on ArangoDB and build/package the binaries yourself, you are free to use the software for commercial purposes as long as it is not a SaaS, DBaaS, or OEM distribution. You cannot use the Community Edition prepackaged binaries for any of the purposes mentioned above.

Commercial Use of Community Edition

We are also making changes to our Community Edition with the prepackaged ArangoDB binaries available for free on our website. Where before this edition was governed by the same Apache 2.0 license as the source code, it will now be governed by a new ArangoDB Community License, which limits the use of community edition for commercial purposes to a  100GB limit on dataset size in production within a single cluster and a maximum of three clusters. 

Commercial use describes any activity in which you use a product or service for financial gain. This includes whenever you use software to support your customers or products,  since that software is used for business purposes with the intent of increasing sales or supporting customers. This explicitly does not apply to non-profit organizations.

As an example, if you deploy software in production that uses ArangoDB as a database,  the database size is under 100 GB per cluster, and it is limited to a maximum of three clusters within an organization. Even though the software is commercially used, you have no commercial obligation to ArangoDB because it falls under the allowed limits. Similarly, non-production deployments such as QA, Test, and Dev using community edition create no commercial obligations to ArangoDB.

Our Enterprise Edition will continue to be governed by the existing ArangoDB Enterprise License.

What should Community users do?

The license changes will roll out and be effective with the release of 3.12 slated for the end of Q1 2024, and there will be no immediate impact to any releases prior to 3.12. Once the license changes are fully applied, there will be a few impacts:

  • If you are using Community Edition or Source Code for your managed service (DBaaS, SaaS), you will be unable to do so for future versions of ArangoDB starting with version 3.12.
  • If you are using Community Edition or Source Code and distributing it to your customers along with your software, you will be unable to do so for future versions of ArangoDB starting with version 3.12.
  • If you are using the Community Edition for commercial purposes for any production deployment either storing greater than 100 GB of data per cluster or having more than three clusters or both - you are required to have a commercial agreement with ArangoDB starting with version 3.12.

If any of these apply to you and you want to avoid future disruption, we encourage you to contact us so that we can work with you to find a commercially acceptable solution for your business.

How is ArangoDB easing the transition for community users with this change?

ArangoDB is willing to make concessions for community users to help them with the transition and the license change. Our joint shared goal is to both enable ArangoDB to continue commercially as the primary developer of the CE edition and still allow our CE users to have successful deployments that meet their business and commercial goals. Support from Arango and help with ongoing help with your deployments (Our Customer Success Team) allows us to maintain the quality of deployments and, ultimately, a more satisfying experience for users.

We do not intend to create hardship for the community users and are willing to discuss reasonable terms and conditions for commercial use.

ArangoDB can offer two solutions to meet your commercial use needs:

  1. Enterprise License: Provide a full-fledged enterprise license for your commercial use with all the enterprise features along with Enterprise SLA and Support.
  2. Community Transition We do not intend to create hardship for the community users and hence created a 'CE Transition Fund', which can be allocated by mutual discussion to ease the transition. This will allow us to balance the value that CE brings to an organization and the Support/Features available.

Summary

Our commitment to open-source ideals remains unshaken. Adjusting our model is essential to ensure ArangoDB’s longevity and to provide you with the cutting-edge features you expect from us. We continue to uphold our vision of an inclusive, collaborative, and innovative community. This change ensures we can keep investing in our products and you, our valued community.

Frequently Asked Questions

1. Does this affect the commercially packaged editions of your software such as Arango Enterprise Edition, and ArangoGraph Insights Platform? 

No, this only affects ArangoDB source code and ArangoDB Community Edition. 

2. Whom does this change primarily impact?

This has no effect on most paying customers, as they already license ArangoDB under a commercial license. This change also has no effect on users who use ArangoDB for non-commercial purposes. This change affects open-source users who are using ArangoDB for commercial purposes and/or distributing and monetizing ArangoDB with their software.

3: Why change now?

ArangoDB 3.12 is a breakthrough release that includes improved performance, resilience, and memory management. These highly appealing design changes may motivate third parties to fork ArangoDB source code in order to create their own commercial derivative works without giving back to the developer community. We feel it is in the best interest of the community and our customers to avoid that outcome. 

4: In four years, after the Change Date, can I make my own commercial product from ArangoDB 3.12 source code under Apache 2.0?  

Yes, if you desire.

5: Is ArangoDB still an Open Source company?

Yes. While the BSL 1.1 is not an official open source license approved by the Open Source Initiative (OSI), we still license a large amount of source code under an open source license such as our Drivers, Kube-Arango Operator, Tools/Utilities, and we continue to host ArangoDB-related open source projects.  Furthermore, the BSL only restricts the use of our source code if you are trying to commercialize it. Finally, after four years, the source code automatically converts to an OSI-approved license (Apache 2.0). 

6: How does the license change impact other products, specifically the kube-arango operator?

There are two versions of the kube-arango operator: the Community and the Enterprise versions. At this time there are no plans to change licensing terms for the operator. The operator will, however, automatically enforce the licensing depending upon the ArangoDB version under management (enterprise or community).

More info...

ArangoDB 3.12 – Performance for all Your Data Models

Estimated reading time: 6 minutes

We are proud to announce the GA release of ArangoDB 3.12!

Congrats to the team and community for the latest ArangoDB release 3.12! ArangoDB 3.12 is focused on greatly improving performance and observability both for the core database and our search offering. In this blog post, we will go through some of the most important changes to ArangoDB and give you an idea of how this can be utilized in your products.

(more…)
More info...

Advanced Fraud Detection in Financial Services with ArangoDB and AQL

Estimated reading time: 3 minutes

Advanced Fraud Detection: ArangoDB’s AQL vs. Traditional RDBMS

In the realm of financial services, where fraud detection is both critical and complex, the choice of database and query language can impact the efficiency and effectiveness of fraud detection systems. Let’s explore how ArangoDB – a multi-model graph database – is powered by AQL (ArangoDB Query Language) to handle multiple, real-world fraud detection scenarios in a much more seamless and powerful way compared to traditional Relational Database Management Systems (RDBMS).

(more…)
More info...
Data Science Personas banner

Who’s Who in Data Science

Estimated reading time: 10 minutes

Multiple data science personas participate in the daily operations of data logistics and intelligent business applications. Management and employees need to understand the big picture of data science to maximize collaboration efforts for these operations. This article will highlight the specialized roles and skillsets needed for the different data science tasks and the best tools to empower data-driven teams. You will come away from this article with a better understanding of how to support your own data science teams, and it is valuable for both managers and team members alike.

(more…)
More info...
Blog Post Template

Community Notebook Challenge

Calling all Community Members! 🥑

Today we are excited to announce our Community Notebook Challenge.

What is our Notebook Challenge you ask? Well, this blog post is going to catch you up to speed and get you excited to participate and have the chance to win the grand prize: a pair of custom Apple Airpod Pros.

(more…)
More info...

Sort-Limit Optimization in AQL

Sometimes we want sorted output from a query and, for whatever reason, cannot use an index to do the sorting. In ArangoDB, we already cover this critical case with finely tuned query execution code. Sometimes though, we do not need to return all output, and follow our SORT clause with LIMIT. In ArangoDB 3.4 and earlier, we did not handle this case any differently from returning the full data, at least with respect to sorting – we would sort the full input, then apply the limit afterwards.

Read more
More info...

Using the WebUI AQL Editor – Basics

The ArangoDB query language (AQL) can be used to retrieve and modify data that is stored in ArangoDB. The AQL editor in the web interface is useful for running ad hoc AQL queries and trying things out.

The editor is split into three parts. The center section allows you to write your query and modify your query bind parameters. At the bottom you can either run the query or explain it, allowing to explain the query and inspect its execution plan. This can be used to check if the query uses indexes, and which. Here more information about optimizing a query. Read more

More info...

From Zero to Advanced Graph Query Knowledge with ArangoDB

More info...

Arangochair – a tool for listening to changes in ArangoDB

The ArangoDB team gave me an opportunity to write a tutorial about arangochair. Arangochair is the first attempt to listen for changes in the database and execute actions like pushing a document to the client or execute an AQL query. Currently it is limited to single nodes.

This tutorial is loosely based on the example at baslr/arangochair-serversendevents-demo

arangochair is a Node.js module hosted on npm which make it fairly easy to install. Just run
npm install arangochair and its installed. Read more

More info...

Get the latest tutorials,
blog posts and news: