home shape

Vector Search in ArangoDB: Practical Insights and Hands-On Examples

Estimated reading time: 5 minutes

Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach. The Vector Search capability is currently in Developer Preview and will be in production release in Q1, 2025.

This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.

What is Vector Search and Why Does it Matter?

Vector search lets you find similar items in a dataset by comparing their embeddings. Embeddings are essentially compact numerical representations of data-like a fingerprint for each data point-that capture its essence in a way that machine learning models can process.

For instance:

  • Text: An embedding might represent the meaning of a sentence.
  • Images: An embedding might capture the general appearance of an object.
  • Audio: An embedding might represent the rhythm or tone of a sound.

Traditional search methods like keyword matching or exact lookups can’t handle the subtle relationships captured by embeddings. Vector search fills that gap, finding semantically similar results even if the original data is complex or unstructured.

Setting Up Vector Search in ArangoDB

To get started, you need to create a vector index, which makes searching embeddings fast and efficient.

Step 1: Create a Vector Index in AQL

Imagine you have a collection called items where each document includes a vector embedding stored in the attribute vector_data. To set up an index using arangosh:

>>> db.items.ensureIndex(
{
        name: “vector_cosine”
        type: “vector”
        fields: [“vector_data”]
        params: { metric: “cosine”, dimension: 128, nLists: 100 }
}

Explanation:

  • type: “vector” specifies this is a vector index.
  • dimension: 128 indicates the size of the embeddings.
  • metric: “cosine” defines the similarity measurement (another can be l2).
  • nLists: 100 defines the number of clusters used in the index.This parameter is subject to change in the future for improved performance & user experience.

This step prepares the collection for efficient similarity searches.

Now, let’s query the items collection to find the five most similar embeddings to a query vector:
@query, which could be set to an embedding like [0.1, 0.3, 0.5, …]:

FOR doc IN items
    LET score = APPROX_NEAR_COSINE(doc.vector_data, @query)
    SORT score DESC
    LIMIT 5
    RETURN {doc, score}   

Explanation:

  • SCORE: The cosine distance between the query vector and the document’s as a number between 0 and 1. The closer the score is to 1, the closer the query vector is to the document.
  • APPROX_NEAR_COSINE: Compares the vector_data attribute of each document with the query vector using Approximate Nearest Neighbor search via Cosine distance.
  • SORT: Orders the results by similarity score, descending.
  • LIMIT: Restricts the results to the top 5 matches.

Combining Vector Search with Graph Traversals

One of ArangoDB’s strengths is its multi-model capability, which allows you to combine vector search with graph traversal. For example, in a fraud detection scenario, you might:

  1. Use vector search to find similar case descriptions.
  2. Traverse the graph to uncover related entities (e.g., linked transactions or individuals).

Example: Vector Search + Graph Traversal

Let @query be set to an embedding like [0.1, 0.3, 0.5, …]:

FOR doc IN items
      LET score = APPROX_NEAR_COSINE(doc.vector_data, @query)
      SORT score DESC
      LIMIT 5
      LET related_nodes = (
          FOR v, e, p IN 1..2 ANY doc GRAPH ‘fraud_graph’
            RETURN v
      )
      RETURN {doc, score, related_nodes}

Explanation:

  • The first query finds documents similar to the query vector (just like the example above).
  • The LET sub-query performs a graph traversal on the results, fetching nodes related to each document.
  • The final RETURN combines the document, similarity score, and its related graph data.

GraphRAG: Combining Vector Search and Knowledge Graphs

GraphRAG (Graph-based Retrieval-Augmented Generation) combines vector search with knowledge graphs to enhance natural language query handling. It retrieves both semantically similar results (using vector embeddings) and highly structured insights (via graph traversal), making it ideal for use cases like law enforcement, fraud detection, and advanced recommendation systems.

How to Implement GraphRAG with ArangoDB

  1. Store Embeddings and Relationships
    • Store vector embeddings in Document Collections (either Vertices or Edges).
    • Organize Entities and their Relationships in the Graph.
  2. Set Up Query Pipeline
    • Use Vector Search to find semantically similar items.
    • Traverse the Graph to uncover related Entities and Relationships.
  3. Combine with an LLM (Large Language Model)
    • Use the results to provide context for an LLM, enabling more accurate and context-aware responses.

Natural Language Querying with LangChain

To allow users to query using natural language, you can integrate LangChain with ArangoDB. LangChain converts a user’s natural language input into structured AQL queries. Here’s how you might implement this:

Step 1: Define the Workflow

  • User inputs a query like: “Find cases similar to Cybercrime A1425 and show related transactions.”
  • LangChain processes the query to understand its intent and structure.
  • The tool generates an AQL query combining vector search and graph traversal.

Step 2: Example LangChain Integration

from arango import ArangoClient
from langchain_openai import ChatOpenAI
from langchain_community.graphs import ArangoGraph
from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain

# Initialize ArangoDB connection
client = ArangoClient(“http://localhost:8529”)
db = client.db(username=”root”, password=”test”)

# Select the LLM of choice
llm = ChatOpenAI(temperature=0, model_name=“gpt-4”)

# Define the natural language interface chain
chain = ArangoGraphQAChain.from_llm(
llm=llm, graph=ArangoGraph(self.db)
)

# Invoke the chain interface
response = chain.invoke(“find cases similar to Cybercrime A1425 & their related transactions”)

print(response)

Explanation:

  • LangChain generates the AQL query based on the user’s input.
  • The generated query could combine vector search and graph traversal, as shown earlier.
  • The result is sent back to the user as structured insights.

Why Combine Vector Search with Graph?

By pairing vector search with graph traversal, you get the best of both worlds:

  • Vector Search: Excels at retrieving semantically similar, unstructured data.
  • Graph Traversal: Shines when exploring structured relationships.

For instance:

  • In fraud detection, vector search finds similar cases, while graph traversal uncovers linked transactions and actors.
  • In law enforcement, vector search identifies relevant documents, and graph traversal maps connections between suspects.

Conclusion

ArangoDB’s vector search, powered by FAISS, is more than a standalone feature-it’s a force multiplier for combining advanced data science techniques with graph-based insights. Whether you’re implementing natural language interfaces with LangChain or building hybrid query pipelines for real-world problems, the integration of vector search into ArangoDB’s multi-model system opens up endless possibilities.

Get started by creating your vector index, crafting AQL queries, and exploring what’s possible when you blend vectors with graphs. The tools are ready-now it’s up to you to build something amazing.

Anthony rsz img

Anthony Mahanna

Anthony is an Honours Computer Science student at the University of Ottawa, Canada. He first discovered ArangoDB’s multi-model services while working on his image repository side project. After presenting his side project in an ArangoDB Community Pioneer session, Anthony transitioned to working with the Core & ML teams as an SWE intern.

Leave a Comment





Get the latest tutorials, blog posts and news: