Table of Contents
Fetching ...

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang

TL;DR

VideoRAG addresses the gap in applying retrieval-augmented generation to extremely long-context videos. It introduces a dual-channel architecture that uses graph-based textual knowledge grounding and multi-modal context encoding to index and retrieve information across unlimited-length videos. The authors present LongerVideos, a large long-video benchmark, and demonstrate that VideoRAG outperforms baselines and long-video models through extensive ablations and case studies. The work includes open-source code and dataset to foster further research.

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: https://github.com/HKUDS/VideoRAG.

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

TL;DR

VideoRAG addresses the gap in applying retrieval-augmented generation to extremely long-context videos. It introduces a dual-channel architecture that uses graph-based textual knowledge grounding and multi-modal context encoding to index and retrieve information across unlimited-length videos. The authors present LongerVideos, a large long-video benchmark, and demonstrate that VideoRAG outperforms baselines and long-video models through extensive ablations and case studies. The work includes open-source code and dataset to foster further research.

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: https://github.com/HKUDS/VideoRAG.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The overall framework of our proposed RAG framework VideoRAG for videos.
  • Figure 2: Ablation on graph-based knowledge grounding and cross-modal retrieval components.
  • Figure 3: Instructions for LLM-based answer comparison and scoring