Table of Contents
Fetching ...

iRAG: Advancing RAG for Videos with an Incremental Approach

Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar

TL;DR

The paper tackles the inefficiency of conventional RAG pipelines for long videos by addressing the dual problems of lengthy upfront video-to-text processing and information loss. It introduces iRAG, an incremental RAG framework that first builds a fast, lightweight index and then on-demand uses heavyweight models to extract detailed information only where needed, guided by the user query. The core contributions are the Planner and Extractor modules that orchestrate query-aware retrieval and on-demand augmentation of the textual index, enabling interactive querying with 23x–25x ingestion speedups while maintaining latency and answer quality comparable to full upfront RAG. Practically, iRAG enables timely, accurate understanding of large video corpora and can be extended to other non-text data modalities, broadening the applicability of retrieval-augmented reasoning for complex, time-series content.

Abstract

Retrieval-augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for understanding of videos is appealing but there are two critical limitations. One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times. Also, not all information in the rich video data is typically captured in the text descriptions. Since user queries are not known apriori, developing a system for video to text conversion and interactive querying of video data is challenging. To address these limitations, we propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of a large corpus of videos. Unlike traditional RAG, iRAG quickly indexes large repositories of videos, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the videos to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long video to text conversion times, and overcomes information loss issues due to conversion of video to text, by doing on-demand query-specific extraction of details in video data. This ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos. Experimental results on real-world datasets demonstrate 23x to 25x faster video to text ingestion, while ensuring that latency and quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any user querying.

iRAG: Advancing RAG for Videos with an Incremental Approach

TL;DR

The paper tackles the inefficiency of conventional RAG pipelines for long videos by addressing the dual problems of lengthy upfront video-to-text processing and information loss. It introduces iRAG, an incremental RAG framework that first builds a fast, lightweight index and then on-demand uses heavyweight models to extract detailed information only where needed, guided by the user query. The core contributions are the Planner and Extractor modules that orchestrate query-aware retrieval and on-demand augmentation of the textual index, enabling interactive querying with 23x–25x ingestion speedups while maintaining latency and answer quality comparable to full upfront RAG. Practically, iRAG enables timely, accurate understanding of large video corpora and can be extended to other non-text data modalities, broadening the applicability of retrieval-augmented reasoning for complex, time-series content.

Abstract

Retrieval-augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for understanding of videos is appealing but there are two critical limitations. One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times. Also, not all information in the rich video data is typically captured in the text descriptions. Since user queries are not known apriori, developing a system for video to text conversion and interactive querying of video data is challenging. To address these limitations, we propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of a large corpus of videos. Unlike traditional RAG, iRAG quickly indexes large repositories of videos, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the videos to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long video to text conversion times, and overcomes information loss issues due to conversion of video to text, by doing on-demand query-specific extraction of details in video data. This ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos. Experimental results on real-world datasets demonstrate 23x to 25x faster video to text ingestion, while ensuring that latency and quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any user querying.
Paper Structure (25 sections, 7 figures, 7 tables)

This paper contains 25 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Video to text conversion
  • Figure 2: Conventional RAG workflow.
  • Figure 3: Time to generate a long document from real-world videos
  • Figure 4: iRAG overview: The light blue rectangle denotes additional workflow compared to a conventional RAG.
  • Figure 5: Query processing time distribution for different values of $k$
  • ...and 2 more figures