Vector Search in ArangoDB: Practical Insights and Hands-On Examples
Estimated reading time: 6 minutes
Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach.
This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.
What is Vector Search and Why Does it Matter?
Vector search lets you find similar items in a dataset by comparing their embeddings. Embeddings are essentially compact numerical representations of data-like a fingerprint for each data point-that capture its essence in a way that machine learning models can process.
For instance:
- Text: An embedding might represent the meaning of a sentence.
- Images: An embedding might capture the general appearance of an object.
- Audio: An embedding might represent the rhythm or tone of a sound.
Traditional search methods like keyword matching or exact lookups can’t handle the subtle relationships captured by embeddings. Vector search fills that gap, finding semantically similar results even if the original data is complex or unstructured.
Setting Up Vector Search in ArangoDB
To get started, you need to create a vector index, which makes searching embeddings fast and efficient.
Step 1: Create a Vector Index in AQL
Imagine you have a collection called items where each document includes a vector embedding stored in the attribute vector_data. To set up an index using arangosh:
{
name: “vector_cosine”
type: “vector”
fields: [“vector_data”]
params: { metric: “cosine”, dimension: 128, nLists: 100 }
}
Explanation:
- type: “vector” specifies this is a vector index.
- dimension: 128 indicates the size of the embeddings.
- metric: “cosine” defines the similarity measurement (another can be l2).
- nLists: 100 defines the number of clusters used in the index.
This step prepares the collection for efficient similarity searches.
Step 2: Perform a Vector Search
Now, let’s query the items collection to find the five most similar embeddings to a query vector:
FOR doc IN items
LET score = APPROX_NEAR_COSINE(doc.vector_data, query)
SORT score DESC
LIMIT 5
RETURN {doc, score}
Explanation:
- SCORE: The cosine distance between the query vector and the document’s as a number between 0 and 1. The closer the score is to 1, the closer the query vector is to the document.
- APPROX_NEAR_COSINE: Compares the vector_data attribute of each document with the query vector using Approximate Nearest Neighbor search via Cosine distance.
- SORT: Orders the results by similarity score, descending.
- LIMIT: Restricts the results to the top 5 matches.
Combining Vector Search with Graph Traversals
One of ArangoDB’s strengths is its multi-model capability, which allows you to combine vector search with graph traversal. For example, in a fraud detection scenario, you might:
- Use vector search to find similar case descriptions.
- Traverse the graph to uncover related entities (e.g., linked transactions or individuals).
Example: Vector Search + Graph Traversal
FOR doc IN items
LET score = APPROX_NEAR_COSINE(doc.vector_data, query)
SORT score DESC
LIMIT 5
LET related_nodes = (
FOR v, e, p IN 1..2 ANY doc GRAPH ‘fraud_graph’
RETURN v
)
RETURN {doc, score, related_nodes}
Explanation:
- The first query finds documents similar to the query vector (just like the example above).
- The LET sub-query performs a graph traversal on the results, fetching nodes related to each document.
- The final RETURN combines the document, similarity score, and its related graph data.
GraphRAG: Combining Vector Search and Knowledge Graphs
GraphRAG (Graph-based Retrieval-Augmented Generation) combines vector search with knowledge graphs to enhance natural language query handling. It retrieves both semantically similar results (using vector embeddings) and highly structured insights (via graph traversal), making it ideal for use cases like law enforcement, fraud detection, and advanced recommendation systems.
How to Implement GraphRAG with ArangoDB
- Store Embeddings and Relationships
- Store vector embeddings in Document Collections (either Vertices or Edges).
- Organize Entities and their Relationships in the Graph.
- Set Up Query Pipeline
- Use Vector Search to find semantically similar items.
- Traverse the Graph to uncover related Entities and Relationships.
- Combine with an LLM (Large Language Model)
- Use the results to provide context for an LLM, enabling more accurate and context-aware responses.
Natural Language Querying with LangChain
To allow users to query using natural language, you can integrate LangChain with ArangoDB. LangChain converts a user’s natural language input into structured AQL queries. Here’s how you might implement this:
Step 1: Define the Workflow
- User inputs a query like: “Find cases similar to Cybercrime A1425 and show related transactions.”
- LangChain processes the query to understand its intent and structure.
- The tool generates an AQL query combining vector search and graph traversal.
Step 2: Example LangChain Integration
from langchain_openai import ChatOpenAI
from langchain_community.graphs import ArangoGraph
from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain
# Initialize ArangoDB connection
client = ArangoClient(“http://localhost:8529”)
db = client.db(username=”root”, password=”test”)
# Select the LLM of choice
llm = ChatOpenAI(temperature=0, model_name=“gpt-4”)
# Define the natural language interface chain
chain = ArangoGraphQAChain.from_llm(
llm=llm, graph=ArangoGraph(self.db)
)
# Invoke the chain interface
response = chain.invoke(“find cases similar to Cybercrime A1425 & their related transactions”)
print(response)
Explanation:
- LangChain generates the AQL query based on the user’s input.
- The generated query could combine vector search and graph traversal, as shown earlier.
- The result is sent back to the user as structured insights.
Why Combine Vector Search with Graph?
By pairing vector search with graph traversal, you get the best of both worlds:
- Vector Search: Excels at retrieving semantically similar, unstructured data.
- Graph Traversal: Shines when exploring structured relationships.
For instance:
- In fraud detection, vector search finds similar cases, while graph traversal uncovers linked transactions and actors.
- In law enforcement, vector search identifies relevant documents, and graph traversal maps connections between suspects.
HybridGraphRAG: Combining Vector Search with Graph Traversals and Full-Text Search
HybridGraphRAG extends the power of ArangoDB by combining three advanced retrieval mechanisms: vector search, graph traversal, and full-text search. This hybrid approach ensures you can handle complex, multi-dimensional queries that involve both semantic similarity and structured data relationships.
Why Use HybridGraphRAG?
When combining these technologies, you can:
- Retrieve semantically similar documents using vector search.
- Explore relationships between entities through graph traversal.
- Match specific keywords or phrases using full-text search.
This approach is ideal for applications like fraud detection, law enforcement, or personalized recommendation systems where structured and unstructured data complement each other.
How to Implement HybridGraphRAG in AQL
Let’s walk through an example where you:
- Use full-text search to find documents mentioning “cybercrime” and “threat”.
- Use vector search to retrieve documents similar to the query embedding.
- Use graph traversals to find relationships between the retrieved documents.
Combined Query Example:
LET text_matches = (
FOR doc IN itemsView
SEARCH PHRASE(doc.text, [“cybercrime”, “threat”], “text_en”)
RETURN doc._id
)
FOR doc IN items
FILTER doc._id IN text_matches
LET score = APPROX_NEAR_COSINE(doc.vector_data, query)
SORT score DESC
LIMIT 5
LET related_nodes = (
FOR v, e, p IN 1..2 ANY doc GRAPH ‘fraud_graph’
RETURN v
)
RETURN {doc, score, related_nodes}
Explanation:
- itemsView: An ArangoDB View representing the inverted index of doc.text.
- SEARCH PHRASE(…): Matches documents with the keyword “cybercrime” & “threat”.
- APPROX_NEAR_COSINE: Compares each doc.vector_data with the query vector using Approximate Nearest Neighbor search via Cosine distance.
- LET related_nodes: Fetches the 1-to-2-hop neighborhood of the matching documents.
Conclusion
ArangoDB’s vector search, powered by FAISS, is more than a standalone feature-it’s a force multiplier for combining advanced data science techniques with graph-based insights. Whether you’re implementing natural language interfaces with LangChain or building hybrid query pipelines for real-world problems, the integration of vector search into ArangoDB’s multi-model system opens up endless possibilities.
Get started by creating your vector index, crafting AQL queries, and exploring what’s possible when you blend vectors with graphs. The tools are ready-now it’s up to you to build something amazing.
Get the latest tutorials, blog posts and news: