
Generate a Video Knowledge Graph: NVIDIA VSS Blueprint with GraphRAG on ArangoDB
Nvidia Blog:
How to Integrate Computer Vision Pipelines with Generative AI and Reasoning
Arango’s Role in the NVIDIA Use Case
- Stores the knowledge graph: ArangoDB is used to save the graph that’s built from video captions.
- Enables fast graph reasoning: With GPU acceleration (CUDA/cuGraph), ArangoDB makes it quick to connect relationships and find patterns across video data.
- Supports advanced Q&A: When a user asks a question, an AI agent can traverse the graph in ArangoDB to find answers—even across multiple cameras.
- Designed for scale: This setup is especially useful in large, high-throughput environments where many AI models are running at once.
NVIDIA Blueprint for video search and summarization (VSS) provides a sample architecture to develop visually perceptive and interactive video analytics visual AI agents. VSS Blueprint from Metropolis combines generative AI, VLMs, LLMs, RAG, and media management services. These AI agents can be deployed throughout factories, warehouses, retail stores, airports, traffic intersections, and more - helping streamlining operations. The VSS 2.4 release makes it easy to enhance vision AI applications with generative AI through a VLM, enabling powerful new features for smart infrastructure.
VSS 2.4 introduces a major upgrade for long‑form video analytics: GraphRAG on ArangoDB. This release brings video‑first Knowledge Graph generation, hybrid retrieval combining vector search, full-text search and graph traversals, and multi‑stream ingestion.
For readers new to VSS, see the broader architecture, APIs, and features like multi‑live stream, burst ingestion, audio transcription, and CV metadata in the earlier post: Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization.
Figure 1: VSS Dataflow with ArangoDB
GraphRAG for Video Analytics
Vision language models (VLMs) made broad perception possible, but long videos can strain context windows and dilute relevance. VSS 2.4 addresses this by combining:
- Semantic ranking of chunks and entities (vector search),
- Structured expansion over a Knowledge Graph (relationship‑aware traversal),
- Temporal stitching for coherent narratives.
This provides grounded systems that cite the who/what/where/when, maintain temporal continuity, and scale to multi‑stream deployments.
What’s new in VSS 2.4 for Graphs
- Introducing ArangoDB support as a multi-model database tuned for video intelligence.
- A video‑first graph schema modeling time, hierarchy, and entity relations.
- Hybrid retrieval that merges semantic & lexical similarity with hop‑limited graph traversal.
- Multi‑stream ingestion that preserves per‑stream metadata for flexible analytics.
What your Knowledge Graph can now capture
VSS 2.4 converts video outputs and metadata into a Knowledge Graph, expanding beyond plain text to improve retrieval precision and explainability.
- Core data points and attributes
- Documents: Logical containers for videos/sessions.
- Chunks: Timestamped segments with captions/transcripts and embeddings.
- Entities: Named people, equipment, locations, and concepts with types/descriptions extracted from Chunks.
- Communities: Batch‑level or thematic summaries of sequential chunks for macro reasoning.
- Temporal structure
- Start/end times per chunk for precise windows.
- CHUNK → NEXT_CHUNK → CHUNK edges to stitch adjacent moments into a narrative
- Relational context
- CHUNK → HAS_ENTITY → ENTITY edges to ground mentions in each chunk.
- ENTITY → LINKS_TO → ENTITY edges to connect entities with typed relations.
- CHUNK → PART_OF → DOCUMENT edges to bind chunks to documents.
- IN_SUMMARY/SUMMARY_OF edges to align summaries with their evidence.
- Operation Metadata
- Stream IDs and Camera IDs for multi-stream analytics, scoping, and auditability.
- Asset references (e.g frame directories) to trace evidence.
- Multimodal signals
- Dense captions and (optional) audio transcripts for complementary cues.
- Embeddings on chunks, entities, and summaries to enable semantic search and clustering.
For instance, given a video of a bridge captured for structural inspection (refer to the View Examples section here), these facets allow the system to answer questions like the ones below with timestamps, cameras, and entities referenced for transparency.
- What structural issues are visible across the video and which areas are most affected?
- Are there any immediate safety risks based on the visible condition of the bridge’s metal and concrete components?
- How does the level of rust and corrosion change throughout the video, and what sections require urgent maintenance?
- Does the surrounding environment appear to be impacting the bridge’s structural integrity?
- Is the bridge overall stable and usable, or does it show signs of potential failure without intervention?
Figure 2: Video Chunks stored in ArangoDB
Breakdown: How to Ingest Video Data
1. Segmentation and metadata (Figure 2): Split long videos into timestamped chunks; attach session/stream/camera IDs, offsets, and asset references.
2. Entity/relationship extraction (Figure 3 & 4): Identify entities (people, equipment, places, concepts) and typed relations; bind entities to the chunks that mention them. Restrict to custom entity & relationship types if specified by the user.
3. Temporal + hierarchy links (Figure 3): Connect the first chunk to its parent document; link adjacent chunks to form a time chain; keep per-chunk provenance.
4.Communities/summarization (Figure 5): Create higher-level community summaries of chunks; link supporting chunks to summaries and summaries back to documents.
5. Graph persistence (Figure 6 & 7): Store chunks, entities, documents, and communities with typed edges (has-entity, links-to, part-of, next-chunk, in-summary, summary-of, etc.).
6. Embeddings + vector indexing: Embed chunks/entities/summaries; build cosine-based vector indices sized to corpus and embedding dimension; optionally enable hybrid (keyword + vector).
7. Hygiene: Normalize entities, reduce duplicates, resolve similar triplets.
Figure 3: Generating a Knowledge Graph from Video Chunks (green) with Entities (yellow), Communities (magenta), and Documents (blue)
Figure 4: Mapping Chunks (green) to their source Document (blue) and Communities (magenta)
Figure 5: Mapping Chunks (green) to their Community Summaries (magenta)
Figure 6: Visualizing a sample VSS Graph
Figure 7: Sample Entities & Relationships
Breakdown: How to Retrieve Data
1. Select profile: Chunk-centric (time-localized), entity-centric (who/what/where), or GNN-ready (structured graph payload).
2. Embed: Convert the question into the same vector space as chunks/entities.
3. Rank (vector): Select top‑K candidates by cosine similarity; optionally combine with keyword scoring for terms and names.
4. Expand (graph): Add nearby evidence with limited-hop traversal:
- From chunks to mentioned entities (has-entity).
- Between entities via typed relations (links-to).
- To summaries for macro context, sibling chunks, and provenance.
5. Stitch (temporal): Pull pre/post neighbors along the time chain for coherent narratives; apply time/camera filters as needed.
6. Pack context: Deduplicate and order evidence by score/time; include text snippets, entities/relations, timestamps, and stream/camera IDs.
7. Output formats: Text-centric context for summarization/Q&A, or a GNN-ready graph (nodes, relation types, edge indices, descriptions).
8. Tuning: Adjust top‑K, hop radius (typically 0–2), chunk size, and filters to balance recall, latency, and specificity.
How this fits the broader VSS updates
The original VSS post introduced GA features such as multi‑live stream, burst mode ingestion, a customizable CV pipeline, and audio transcription. These modalities feed the GraphRAG pipeline so the agent can:
- Fuse visual information and audio transcriptions to improve precision,
- Use object/tracking metadata to clarify which entities are involved,
- Maintain per‑stream separation while supporting cross‑stream queries.
Together, these enable the temporal reasoning, multi‑hop reasoning, anomaly awareness, and scalability discussed in the CA‑RAG section of the original post, but now reinforced by a robust Knowledge Graph.
Figure 8: NVIDIA AI Blueprint for Video Search And Summarization
Get started
- Read the VSS Blueprint overview for APIs and deployment options (API Catalog, Launchables, Docker/Helm, Cloud): Advance Video Analytics AI Agents Using the NVIDIA Blueprint for Video Search and Summarization.
- Learn more about ArangoDB (https://arangodb.com) and try VSS 2.4 with ArangoDB on your own videos. Other resources include:
- Explore code, examples, and deployment recipes:
- VSS Blueprint
- CA-RAG
- Documentation: https://nvidia.github.io/context-aware-rag/
- Github: https://github.com/NVIDIA/context-aware-rag
Get the latest tutorials, blog posts and news: