home shape

Vector Search in ArangoDB: Practical Insights and Hands-On Examples

Estimated reading time: 6 minutes

Vector search is gaining traction as a go-to tool for handling large, unstructured datasets like text, images, and audio. It works by comparing vector embeddings, numerical representations generated by machine learning models, to find items with similar properties. With the integration of Facebook’s FAISS library, ArangoDB brings scalable, high-performance vector search directly into its core, accessible via AQL (ArangoDB Query Language). Vector Search is now just another, fully-integrated data type/model in ArangoDB’s multi-model approach.

This guide will walk you through setting up vector search, combining it with graph traversal for advanced use cases, and using tools like LangChain to power natural language queries that integrate Vector Search and GraphRAG.

What is Vector Search and Why Does it Matter?

Vector search lets you find similar items in a dataset by comparing their embeddings. Embeddings are essentially compact numerical representations of data-like a fingerprint for each data point-that capture its essence in a way that machine learning models can process.

For instance:

  • Text: An embedding might represent the meaning of a sentence.
  • Images: An embedding might capture the general appearance of an object.
  • Audio: An embedding might represent the rhythm or tone of a sound.

Traditional search methods like keyword matching or exact lookups can’t handle the subtle relationships captured by embeddings. Vector search fills that gap, finding semantically similar results even if the original data is complex or unstructured.

Setting Up Vector Search in ArangoDB

To get started, you need to create a vector index, which makes searching embeddings fast and efficient.

Step 1: Create a Vector Index in AQL

Imagine you have a collection called items where each document includes a vector embedding stored in the attribute vector_data. To set up an index using arangosh:

>>> db.items.ensureIndex(
{
        name: “vector_cosine”
        type: “vector”
        fields: [“vector_data”]
        params: { metric: “cosine”, dimension: 128, nLists: 100 }
}

Explanation:

  • type: “vector” specifies this is a vector index.
  • dimension: 128 indicates the size of the embeddings.
  • metric: “cosine” defines the similarity measurement (another can be l2).
  • nLists: 100 defines the number of clusters used in the index.

This step prepares the collection for efficient similarity searches.

Now, let’s query the items collection to find the five most similar embeddings to a query vector:

LET query = [0.1, 0.3, 0.5, …]
FOR doc IN items
    LET score = APPROX_NEAR_COSINE(doc.vector_data, query)
    SORT score DESC
    LIMIT 5
    RETURN {doc, score}   

Explanation:

  • SCORE: The cosine distance between the query vector and the document’s as a number between 0 and 1. The closer the score is to 1, the closer the query vector is to the document.
  • APPROX_NEAR_COSINE: Compares the vector_data attribute of each document with the query vector using Approximate Nearest Neighbor search via Cosine distance.
  • SORT: Orders the results by similarity score, descending.
  • LIMIT: Restricts the results to the top 5 matches.

    Combining Vector Search with Graph Traversals

    One of ArangoDB’s strengths is its multi-model capability, which allows you to combine vector search with graph traversal. For example, in a fraud detection scenario, you might:

    1. Use vector search to find similar case descriptions.
    2. Traverse the graph to uncover related entities (e.g., linked transactions or individuals).

    Example: Vector Search + Graph Traversal

    LET query = [0.1, 0.3, 0.5, …]
    FOR doc IN items
          LET score = APPROX_NEAR_COSINE(doc.vector_data, query)
          SORT score DESC
          LIMIT 5
          LET related_nodes = (
              FOR v, e, p IN 1..2 ANY doc GRAPH ‘fraud_graph’
                RETURN v
          )
          RETURN {doc, score, related_nodes}

    Explanation:

    • The first query finds documents similar to the query vector (just like the example above).
    • The LET sub-query performs a graph traversal on the results, fetching nodes related to each document.
    • The final RETURN combines the document, similarity score, and its related graph data.

      GraphRAG: Combining Vector Search and Knowledge Graphs

      GraphRAG (Graph-based Retrieval-Augmented Generation) combines vector search with knowledge graphs to enhance natural language query handling. It retrieves both semantically similar results (using vector embeddings) and highly structured insights (via graph traversal), making it ideal for use cases like law enforcement, fraud detection, and advanced recommendation systems.

      How to Implement GraphRAG with ArangoDB

      1. Store Embeddings and Relationships
        • Store vector embeddings in Document Collections (either Vertices or Edges).
        • Organize Entities and their Relationships in the Graph.
      2. Set Up Query Pipeline
        • Use Vector Search to find semantically similar items.
        • Traverse the Graph to uncover related Entities and Relationships.
      3. Combine with an LLM (Large Language Model)
        • Use the results to provide context for an LLM, enabling more accurate and context-aware responses.

      Natural Language Querying with LangChain

      To allow users to query using natural language, you can integrate LangChain with ArangoDB. LangChain converts a user’s natural language input into structured AQL queries. Here’s how you might implement this:

      Step 1: Define the Workflow

      • User inputs a query like: “Find cases similar to Cybercrime A1425 and show related transactions.”
      • LangChain processes the query to understand its intent and structure.
      • The tool generates an AQL query combining vector search and graph traversal.

      Step 2: Example LangChain Integration

      from arango import ArangoClient
      from langchain_openai import ChatOpenAI
      from langchain_community.graphs import ArangoGraph
      from langchain_community.chains.graph_qa.arangodb import ArangoGraphQAChain

      # Initialize ArangoDB connection
      client = ArangoClient(“http://localhost:8529”)
      db = client.db(username=”root”, password=”test”)

      # Select the LLM of choice
      llm = ChatOpenAI(temperature=0, model_name=“gpt-4”)

      # Define the natural language interface chain
      chain = ArangoGraphQAChain.from_llm(
      llm=llm, graph=ArangoGraph(self.db)
      )

      # Invoke the chain interface
      response = chain.invoke(“find cases similar to Cybercrime A1425 & their related transactions”)

      print(response)

      Explanation:

      • LangChain generates the AQL query based on the user’s input.
      • The generated query could combine vector search and graph traversal, as shown earlier.
      • The result is sent back to the user as structured insights.

      Why Combine Vector Search with Graph?

      By pairing vector search with graph traversal, you get the best of both worlds:

      • Vector Search: Excels at retrieving semantically similar, unstructured data.
      • Graph Traversal: Shines when exploring structured relationships.

      For instance:

      • In fraud detection, vector search finds similar cases, while graph traversal uncovers linked transactions and actors.
      • In law enforcement, vector search identifies relevant documents, and graph traversal maps connections between suspects.

      HybridGraphRAG extends the power of ArangoDB by combining three advanced retrieval mechanisms: vector search, graph traversal, and full-text search. This hybrid approach ensures you can handle complex, multi-dimensional queries that involve both semantic similarity and structured data relationships.

      Why Use HybridGraphRAG?

      When combining these technologies, you can:

      • Retrieve semantically similar documents using vector search.
      • Explore relationships between entities through graph traversal.
      • Match specific keywords or phrases using full-text search.

      This approach is ideal for applications like fraud detection, law enforcement, or personalized recommendation systems where structured and unstructured data complement each other.

      How to Implement HybridGraphRAG in AQL

      Let’s walk through an example where you:

      1. Use full-text search to find documents mentioning “cybercrime” and “threat”.
      2. Use vector search to retrieve documents similar to the query embedding.
      3. Use graph traversals to find relationships between the retrieved documents.

      Combined Query Example:

      LET query = [0.1, 0.3, 0.5, …]
      LET text_matches = (
            FOR doc IN itemsView
                SEARCH PHRASE(doc.text, [“cybercrime”, “threat”], “text_en”)
                RETURN doc._id
      )

      FOR doc IN items
                FILTER doc._id IN text_matches

                LET score = APPROX_NEAR_COSINE(doc.vector_data, query)
                SORT score DESC
                LIMIT 5

      LET related_nodes = (
              FOR v, e, p IN 1..2 ANY doc GRAPH ‘fraud_graph’
                  RETURN v
      )

      RETURN {doc, score, related_nodes}

      Explanation:

      • itemsView: An ArangoDB View representing the inverted index of doc.text.
      • SEARCH PHRASE(…): Matches documents with the keyword “cybercrime” & “threat”.
      • APPROX_NEAR_COSINE: Compares each doc.vector_data with the query vector using Approximate Nearest Neighbor search via Cosine distance.
      • LET related_nodes: Fetches the 1-to-2-hop neighborhood of the matching documents.

      Conclusion

      ArangoDB’s vector search, powered by FAISS, is more than a standalone feature-it’s a force multiplier for combining advanced data science techniques with graph-based insights. Whether you’re implementing natural language interfaces with LangChain or building hybrid query pipelines for real-world problems, the integration of vector search into ArangoDB’s multi-model system opens up endless possibilities.

      Get started by creating your vector index, crafting AQL queries, and exploring what’s possible when you blend vectors with graphs. The tools are ready-now it’s up to you to build something amazing.

      Anthony rsz img

      Anthony Mahanna

      Anthony is an Honours Computer Science student at the University of Ottawa, Canada. He first discovered ArangoDB’s multi-model services while working on his image repository side project. After presenting his side project in an ArangoDB Community Pioneer session, Anthony transitioned to working with the Core & ML teams as an SWE intern.

      Leave a Comment





      Get the latest tutorials, blog posts and news: