Table of Contents
Fetching ...

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang

TL;DR

VideoRAG introduces retrieval-augmented generation over a video corpus, enabling dynamic video retrieval and grounding of responses in both visual and textual video content. It tackles the challenge of long, redundant videos with adaptive frame selection and a clustering-based reduction, while also generating auxiliary transcripts when subtitles are unavailable. Empirical results on WikiHowQA/HowTo100M show VideoRAG consistently outperforms text- and image-based baselines, with larger LVLMs and multimodal representations delivering the strongest gains. The work demonstrates the practical potential of using videos as a rich, multimodal knowledge source for grounding generative QA systems.

Abstract

Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.

VideoRAG: Retrieval-Augmented Generation over Video Corpus

TL;DR

VideoRAG introduces retrieval-augmented generation over a video corpus, enabling dynamic video retrieval and grounding of responses in both visual and textual video content. It tackles the challenge of long, redundant videos with adaptive frame selection and a clustering-based reduction, while also generating auxiliary transcripts when subtitles are unavailable. Empirical results on WikiHowQA/HowTo100M show VideoRAG consistently outperforms text- and image-based baselines, with larger LVLMs and multimodal representations delivering the strongest gains. The work demonstrates the practical potential of using videos as a rich, multimodal knowledge source for grounding generative QA systems.

Abstract

Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
Paper Structure (44 sections, 7 figures, 14 tables)

This paper contains 44 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Illustration of existing and the proposed RAG scenarios. (A) Textual RAG retrieves documents (relevant to queries) from a text corpus and incorporates them when generating answers. (B) Conventional image-text multimodal RAG extends retrieval to include static images. (C) VideoRAG (ours) further extends the external knowledge source to videos.
  • Figure 2: Illustration of the overall pipeline of our VideoRAG, which selects informative frames for retrieval and generation.
  • Figure 2: Retrieval results, where we use visual features alone, textual features alone, or an ensemble of their features.
  • Figure 3: Visualization of latent space of features across modalities with Principal Component Analysis (PCA).
  • Figure 4: Impact of varying the interpolation ratio between textual and visual features on the video retrieval performance.
  • ...and 2 more figures