Table of Contents
Fetching ...

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao

TL;DR

AdaVideoRAG tackles the challenge of long-video understanding by introducing an adaptive retrieval framework that matches query difficulty with retrieval depth. It combines an intent classifier, Omni-Knowledge Indexing (text, visual, and knowledge-graph bases), and a hierarchical retrieval strategy to balance precision and efficiency. The HiVU benchmark is released to evaluate multi-level reasoning on long videos, and extensive experiments show improved accuracy and speed over fixed-path RAG baselines, especially for complex, multi-hop tasks. This approach enables scalable, cognitively deep video analysis and can be readily integrated with existing multimodal LLMs via lightweight APIs, potentially redefining retrieval-augmented video analysis in practical systems.

Abstract

Multimodal Large Language Models (MLLMs) perform well in video understanding but degrade on long videos due to fixed-length context and weak long-term dependency modeling. Retrieval-Augmented Generation (RAG) can expand knowledge dynamically, yet existing video RAG schemes adopt fixed retrieval paradigms that ignore query difficulty. This uniform design causes redundant computation and latency for simple queries, while coarse retrieval for complex, multi-hop reasoning can miss key information. Such single-step retrieval severely limits the trade-off between efficiency and cognitive depth. We propose AdaVideoRAG, an adaptive RAG framework for long-video understanding. A lightweight intent classifier dynamically selects suitable retrieval schemes according to query complexity from the simplest to the most sophisticated. We design an Omni-Knowledge Indexing module that extracts and organizes multi-modal information into three databases: (1) a text base built from clip captions, ASR, and OCR; (2) a visual base; and (3) a knowledge graph for deep semantic understanding. This supports hierarchical knowledge access, from naive retrieval to graph-based retrieval, balancing resource cost and reasoning ability. To evaluate deep understanding, we further construct the HiVU benchmark. Experiments show that AdaVideoRAG significantly improves both efficiency and accuracy on long-video QA tasks and can be seamlessly plugged into existing MLLMs through lightweight APIs, establishing a new paradigm for adaptive retrieval-augmented video analysis.

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

TL;DR

AdaVideoRAG tackles the challenge of long-video understanding by introducing an adaptive retrieval framework that matches query difficulty with retrieval depth. It combines an intent classifier, Omni-Knowledge Indexing (text, visual, and knowledge-graph bases), and a hierarchical retrieval strategy to balance precision and efficiency. The HiVU benchmark is released to evaluate multi-level reasoning on long videos, and extensive experiments show improved accuracy and speed over fixed-path RAG baselines, especially for complex, multi-hop tasks. This approach enables scalable, cognitively deep video analysis and can be readily integrated with existing multimodal LLMs via lightweight APIs, potentially redefining retrieval-augmented video analysis in practical systems.

Abstract

Multimodal Large Language Models (MLLMs) perform well in video understanding but degrade on long videos due to fixed-length context and weak long-term dependency modeling. Retrieval-Augmented Generation (RAG) can expand knowledge dynamically, yet existing video RAG schemes adopt fixed retrieval paradigms that ignore query difficulty. This uniform design causes redundant computation and latency for simple queries, while coarse retrieval for complex, multi-hop reasoning can miss key information. Such single-step retrieval severely limits the trade-off between efficiency and cognitive depth. We propose AdaVideoRAG, an adaptive RAG framework for long-video understanding. A lightweight intent classifier dynamically selects suitable retrieval schemes according to query complexity from the simplest to the most sophisticated. We design an Omni-Knowledge Indexing module that extracts and organizes multi-modal information into three databases: (1) a text base built from clip captions, ASR, and OCR; (2) a visual base; and (3) a knowledge graph for deep semantic understanding. This supports hierarchical knowledge access, from naive retrieval to graph-based retrieval, balancing resource cost and reasoning ability. To evaluate deep understanding, we further construct the HiVU benchmark. Experiments show that AdaVideoRAG significantly improves both efficiency and accuracy on long-video QA tasks and can be seamlessly plugged into existing MLLMs through lightweight APIs, establishing a new paradigm for adaptive retrieval-augmented video analysis.

Paper Structure

This paper contains 25 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of different video understanding frameworks: i) MLLMs are efficient but can only handle simple problems. ii) VideoRAG videorag_xiamen integrates external knowledge via naive retrieval but still struggles with hard reasoning questions. iii) Recent VideoRAG videorag_hku tackles complex problems using graph retrieval but suffers from low efficiency. Our novel AdaVideoRAG framework adaptively routes queries to different retrieval paths via query intent classification, achieving a better trade-off between effectiveness and efficiency.
  • Figure 2: Overview of our AdaVideoRAG framework that consists of: 1) Query Intent Classification (\ref{['sec:intent']}). 2) Omni-Knowledge Indexing (\ref{['sec:indexing']}). 3) Adaptive Retrieval Paradigm (\ref{['sec:retrieval']}). 4) Integration and Generation (\ref{['sec:generation']}).
  • Figure 3: Statistical distributions of our HiVU from different perspectives.
  • Figure A1: Qualitative results of VideoLLaMA when applying Video-RAG
  • Figure A2: Prompts for Level Classification