Table of Contents
Fetching ...

RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

Xichen Tan, Yunfan Ye, Yuanjing Luo, Qian Wan, Fang Liu, Zhiping Cai

TL;DR

This work tackles the information loss in long-video understanding benchmarks caused by uniform frame sampling by introducing RAG-Adapter, a plug-and-play framework that retrieves frames most relevant to a given question and feeds them to multimodal LLMs without changing their architectures. It couples a Dual Reranker with MMR-based selection and introduces MMAT and Grouped-supervised Contrastive Learning (GCL) to align image and text embeddings for effective retrieval. To quantify benchmark quality and task complexity, the authors define Average Similarity Score (ASS) and Necessary Information Frame (NIF), showing that only a small subset of frames typically carries the needed information. Empirical results across Video-MME, MLVU, Perception Test, and EgoSchema demonstrate consistent accuracy gains over uniform sampling, validating RAG-Adapter as a practical enhancement for long-video benchmarks with broad generalization. The approach offers a scalable path to more accurate testing of long-video understanding for both open-source and commercial MLLMs.

Abstract

Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.

RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

TL;DR

This work tackles the information loss in long-video understanding benchmarks caused by uniform frame sampling by introducing RAG-Adapter, a plug-and-play framework that retrieves frames most relevant to a given question and feeds them to multimodal LLMs without changing their architectures. It couples a Dual Reranker with MMR-based selection and introduces MMAT and Grouped-supervised Contrastive Learning (GCL) to align image and text embeddings for effective retrieval. To quantify benchmark quality and task complexity, the authors define Average Similarity Score (ASS) and Necessary Information Frame (NIF), showing that only a small subset of frames typically carries the needed information. Empirical results across Video-MME, MLVU, Perception Test, and EgoSchema demonstrate consistent accuracy gains over uniform sampling, validating RAG-Adapter as a practical enhancement for long-video benchmarks with broad generalization. The approach offers a scalable path to more accurate testing of long-video understanding for both open-source and commercial MLLMs.

Abstract

Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.

Paper Structure

This paper contains 36 sections, 3 equations, 36 figures, 12 tables.

Figures (36)

  • Figure 1: (a) and (b) show a comparison between scenarios with and without the RAG-Adapter framework, respectively.
  • Figure 2: The RAG-Adapter pipeline framework. Given a video and a question, the video frames and corresponding captions are encoded separately using image and text encoders and stored in databases. The question is encoded and retrieved using the same encoders. The Dual Reranker module selects the Top$K$ frames relevant to the question. Details are provided in \ref{['sec:pipeline']}. To improve retrieval performance, both encoders are fine-tuned using Grouped-supervised Contrastive Learning (GCL), as described in \ref{['sec:ft']}.
  • Figure 3: Illustration of Grouped-supervised Contrastive Learning (GCL) constructing positive and negative pairs.
  • Figure 4: Comparison of RAG-Adapter and uniform sampling results: RAG-Adapter accurately identifies two consecutive key frames relevant to the question, whereas uniform sampling tends to miss them.
  • Figure 5: The relationship between the embedding spaces of video frames sampled using different methods and that of the corresponding questions. The frame embeddings are primarily grouped into five clusters, each representing a set of consecutive shots, with each cluster labeled by a representative frame.
  • ...and 31 more figures