home shape

Generate a Video Knowledge Graph: NVIDIA VSS Blueprint with GraphRAG on ArangoDB

Nvidia Blog:
How to Integrate Computer Vision Pipelines with Generative AI and Reasoning

 

Arango’s Role in the NVIDIA Use Case

  • Stores the knowledge graph: ArangoDB is used to save the graph that’s built from video captions.
  • Enables fast graph reasoning: With GPU acceleration (CUDA/cuGraph), ArangoDB makes it quick to connect relationships and find patterns across video data.
  • Supports advanced Q&A: When a user asks a question, an AI agent can traverse the graph in ArangoDB to find answers—even across multiple cameras.
  • Designed for scale: This setup is especially useful in large, high-throughput environments where many AI models are running at once.

 

Video Analytics AI Agents - an Entirely new Class of Applications. Unlock knowledge and insights from camera streams and archived videos

NVIDIA Blueprint for video search and summarization (VSS) provides a sample architecture to develop visually perceptive and interactive video analytics visual AI agents. VSS Blueprint from Metropolis combines generative AI, VLMs, LLMs, RAG, and media management services. These AI agents can be deployed throughout factories, warehouses, retail stores, airports, traffic intersections, and more - helping streamlining operations. The VSS 2.4 release makes it easy to enhance vision AI applications with generative AI through a VLM, enabling powerful new features for smart infrastructure. 

VSS 2.4 introduces a major upgrade for long‑form video analytics: GraphRAG on ArangoDB. This release brings video‑first Knowledge Graph generation, hybrid retrieval combining vector search,  full-text search and graph traversals, and multi‑stream ingestion.

For readers new to VSS, see the broader architecture, APIs, and features like multi‑live stream, burst ingestion, audio transcription, and CV metadata in the earlier post: Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization.

Diagram showcasing the VSS pipeline for Data Sourcing, Content Preparation, Content Processing, KG Generation, Data Storage, KG Retrieval, and ReportingFigure 1: VSS Dataflow with ArangoDB

 

GraphRAG for Video Analytics

Vision language models (VLMs) made broad perception possible, but long videos can strain context windows and dilute relevance. VSS 2.4 addresses this by combining:

  • Semantic ranking of chunks and entities (vector search),
  • Structured expansion over a Knowledge Graph (relationship‑aware traversal),
  • Temporal stitching for coherent narratives.

This provides grounded systems that cite the who/what/where/when, maintain temporal continuity, and scale to multi‑stream deployments.

 

What’s new in VSS 2.4 for Graphs

  • Introducing ArangoDB support as a multi-model database tuned for video intelligence.
  • A video‑first graph schema modeling time, hierarchy, and entity relations.
  • Hybrid retrieval that merges semantic & lexical similarity with hop‑limited graph traversal.
  • Multi‑stream ingestion that preserves per‑stream metadata for flexible analytics.

 

What your Knowledge Graph can now capture

VSS 2.4 converts video outputs and metadata into a Knowledge Graph, expanding beyond plain text to improve retrieval precision and explainability.

  • Core data points and attributes
    • Documents: Logical containers for videos/sessions.
    • Chunks: Timestamped segments with captions/transcripts and embeddings.
    • Entities: Named people, equipment, locations, and concepts with types/descriptions extracted from Chunks.
    • Communities: Batch‑level or thematic summaries of sequential chunks for macro reasoning.
  • Temporal structure
    • Start/end times per chunk for precise windows.
    • CHUNK → NEXT_CHUNK → CHUNK edges to stitch adjacent moments into a narrative
  • Relational context
    • CHUNK → HAS_ENTITY → ENTITY edges to ground mentions in each chunk.
    • ENTITY → LINKS_TO → ENTITY edges to connect entities with typed relations.
    • CHUNK → PART_OF → DOCUMENT edges to bind chunks to documents.
    • IN_SUMMARY/SUMMARY_OF edges to align summaries with their evidence.
  • Operation Metadata
    • Stream IDs and Camera IDs for multi-stream analytics, scoping, and auditability.
    • Asset references (e.g frame directories) to trace evidence.
  • Multimodal signals
    • Dense captions and (optional) audio transcripts for complementary cues.
    • Embeddings on chunks, entities, and summaries to enable semantic search and clustering.

For instance, given a video of a bridge captured for structural inspection (refer to the View Examples section here), these facets allow the system to answer questions like the ones below with timestamps, cameras, and entities referenced for transparency.

  • What structural issues are visible across the video and which areas are most affected?
  • Are there any immediate safety risks based on the visible condition of the bridge’s metal and concrete components?
  • How does the level of rust and corrosion change throughout the video, and what sections require urgent maintenance?
  • Does the surrounding environment appear to be impacting the bridge’s structural integrity?
  • Is the bridge overall stable and usable, or does it show signs of potential failure without intervention?

Frames of the VSS Bridge Video, and their equivalent Chunks stored in ArangoDB
Frames of the VSS Bridge Video, and their equivalent Chunks stored in ArangoDB
Frames of the VSS Bridge Video, and their equivalent Chunks stored in ArangoDBFigure 2: Video Chunks stored in ArangoDB

 

Breakdown: How to Ingest Video Data

1. Segmentation and metadata (Figure 2): Split long videos into timestamped chunks; attach session/stream/camera IDs, offsets, and asset references.

2. Entity/relationship extraction (Figure 3 & 4): Identify entities (people, equipment, places, concepts) and typed relations; bind entities to the chunks that mention them. Restrict to custom entity & relationship types if specified by the user.

3. Temporal + hierarchy links (Figure 3): Connect the first chunk to its parent document; link adjacent chunks to form a time chain; keep per-chunk provenance.

4.Communities/summarization (Figure 5): Create higher-level community summaries of chunks; link supporting chunks to summaries and summaries back to documents.

5. Graph persistence (Figure 6 & 7): Store chunks, entities, documents, and communities with typed edges (has-entity, links-to, part-of, next-chunk, in-summary, summary-of, etc.).

6. Embeddings + vector indexing: Embed chunks/entities/summaries; build cosine-based vector indices sized to corpus and embedding dimension; optionally enable hybrid (keyword + vector).

7. Hygiene: Normalize entities, reduce duplicates, resolve similar triplets.

Generating a Knowledge Graph from Video ChunksFigure 3: Generating a Knowledge Graph from Video Chunks (green) with Entities (yellow), Communities (magenta), and Documents (blue)

 

Mapping ChunksFigure 4: Mapping Chunks (green) to their source Document (blue) and Communities (magenta)

 

Mapping Chunks to their Community SummariesFigure 5: Mapping Chunks (green) to their Community Summaries (magenta)

 

A Knowledge Graph generated by the VSS Pipeline, stored in ArangoDBFigure 6: Visualizing a sample VSS Graph

 

ArangoDB Document samples of Entities & Relationships extracted by the LLM through the VSS Pipeline

ArangoDB Document samples of Entities & Relationships extracted by the LLM through the VSS Pipeline

ArangoDB Document samples of Entities & Relationships extracted by the LLM through the VSS PipelineFigure 7: Sample Entities & Relationships

 

Breakdown: How to Retrieve Data 

1. Select profile: Chunk-centric (time-localized), entity-centric (who/what/where), or GNN-ready (structured graph payload).

2. Embed: Convert the question into the same vector space as chunks/entities.

3. Rank (vector): Select top‑K candidates by cosine similarity; optionally combine with keyword scoring for terms and names.

4. Expand (graph): Add nearby evidence with limited-hop traversal:

  • From chunks to mentioned entities (has-entity).
  • Between entities via typed relations (links-to).
  • To summaries for macro context, sibling chunks, and provenance.

5. Stitch (temporal): Pull pre/post neighbors along the time chain for coherent narratives; apply time/camera filters as needed.

6. Pack context: Deduplicate and order evidence by score/time; include text snippets, entities/relations, timestamps, and stream/camera IDs.

7. Output formats: Text-centric context for summarization/Q&A, or a GNN-ready graph (nodes, relation types, edge indices, descriptions).

8. Tuning: Adjust top‑K, hop radius (typically 0–2), chunk size, and filters to balance recall, latency, and specificity.

 

How this fits the broader VSS updates

The original VSS post introduced GA features such as multi‑live stream, burst mode ingestion, a customizable CV pipeline, and audio transcription. These modalities feed the GraphRAG pipeline so the agent can:

  • Fuse visual information and audio transcriptions to improve precision,
  • Use object/tracking metadata to clarify which entities are involved,
  • Maintain per‑stream separation while supporting cross‑stream queries.

Together, these enable the temporal reasoning, multi‑hop reasoning, anomaly awareness, and scalability discussed in the CA‑RAG section of the original post, but now reinforced by a robust Knowledge Graph.

The VSS Architecture diagramFigure 8: NVIDIA AI Blueprint for Video Search And Summarization

 

Get started

anothony mahanna

Anthony Mahanna

Anthony Mahanna is a software engineer & technical lead for Arango’s GenAI Data Platform, where he applies Graph Analytics, GraphML, and GraphRAG to solve graph-driven AI problems. Anthony joined Arango full-time in July 2023 after previously interning while attending university. He holds a B.Sc (Hons) in Computer Science from the University of Ottawa, Canada.

Leave a Comment





Get the latest tutorials, blog posts and news: