RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
Xichen Tan, Yunfan Ye, Yuanjing Luo, Qian Wan, Fang Liu, Zhiping Cai
TL;DR
This work tackles the information loss in long-video understanding benchmarks caused by uniform frame sampling by introducing RAG-Adapter, a plug-and-play framework that retrieves frames most relevant to a given question and feeds them to multimodal LLMs without changing their architectures. It couples a Dual Reranker with MMR-based selection and introduces MMAT and Grouped-supervised Contrastive Learning (GCL) to align image and text embeddings for effective retrieval. To quantify benchmark quality and task complexity, the authors define Average Similarity Score (ASS) and Necessary Information Frame (NIF), showing that only a small subset of frames typically carries the needed information. Empirical results across Video-MME, MLVU, Perception Test, and EgoSchema demonstrate consistent accuracy gains over uniform sampling, validating RAG-Adapter as a practical enhancement for long-video benchmarks with broad generalization. The approach offers a scalable path to more accurate testing of long-video understanding for both open-source and commercial MLLMs.
Abstract
Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
