AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Zhucun Xue; Jiangning Zhang; Xurong Xie; Yuxuan Cai; Yong Liu; Xiangtai Li; Dacheng Tao

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao

TL;DR

AdaVideoRAG tackles the challenge of long-video understanding by introducing an adaptive retrieval framework that matches query difficulty with retrieval depth. It combines an intent classifier, Omni-Knowledge Indexing (text, visual, and knowledge-graph bases), and a hierarchical retrieval strategy to balance precision and efficiency. The HiVU benchmark is released to evaluate multi-level reasoning on long videos, and extensive experiments show improved accuracy and speed over fixed-path RAG baselines, especially for complex, multi-hop tasks. This approach enables scalable, cognitively deep video analysis and can be readily integrated with existing multimodal LLMs via lightweight APIs, potentially redefining retrieval-augmented video analysis in practical systems.

Abstract

Multimodal Large Language Models (MLLMs) perform well in video understanding but degrade on long videos due to fixed-length context and weak long-term dependency modeling. Retrieval-Augmented Generation (RAG) can expand knowledge dynamically, yet existing video RAG schemes adopt fixed retrieval paradigms that ignore query difficulty. This uniform design causes redundant computation and latency for simple queries, while coarse retrieval for complex, multi-hop reasoning can miss key information. Such single-step retrieval severely limits the trade-off between efficiency and cognitive depth. We propose AdaVideoRAG, an adaptive RAG framework for long-video understanding. A lightweight intent classifier dynamically selects suitable retrieval schemes according to query complexity from the simplest to the most sophisticated. We design an Omni-Knowledge Indexing module that extracts and organizes multi-modal information into three databases: (1) a text base built from clip captions, ASR, and OCR; (2) a visual base; and (3) a knowledge graph for deep semantic understanding. This supports hierarchical knowledge access, from naive retrieval to graph-based retrieval, balancing resource cost and reasoning ability. To evaluate deep understanding, we further construct the HiVU benchmark. Experiments show that AdaVideoRAG significantly improves both efficiency and accuracy on long-video QA tasks and can be seamlessly plugged into existing MLLMs through lightweight APIs, establishing a new paradigm for adaptive retrieval-augmented video analysis.

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

TL;DR

Abstract

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)