Table of Contents
Fetching ...

Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang, Jun Peng, Hui Li, Yiyi Zhou, Rongrong Ji

TL;DR

This paper tackles the memory bottleneck of processing long videos with multimodal LLMs by introducing OneClip-RAG, a plug-and-play framework that uses query-guided video clips as external knowledge to augment reasoning. It unifies clip chunking and cross-modal retrieval in a single pipeline and pairs it with a coarse-to-fine instruction-tuning regime, aided by the SynLongVideo dataset designed to improve instruction following in clip-based retrieval. Across five MLLMs and multiple long-video benchmarks, OneClip-RAG yields substantial accuracy gains and notable efficiency improvements, including hour-long video understanding in minutes on a single GPU. The proposed approach advances practical long-video understanding by reducing computational overhead while retaining semantic coherence, making it feasible for real-world deployment.

Abstract

Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.

Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

TL;DR

This paper tackles the memory bottleneck of processing long videos with multimodal LLMs by introducing OneClip-RAG, a plug-and-play framework that uses query-guided video clips as external knowledge to augment reasoning. It unifies clip chunking and cross-modal retrieval in a single pipeline and pairs it with a coarse-to-fine instruction-tuning regime, aided by the SynLongVideo dataset designed to improve instruction following in clip-based retrieval. Across five MLLMs and multiple long-video benchmarks, OneClip-RAG yields substantial accuracy gains and notable efficiency improvements, including hour-long video understanding in minutes on a single GPU. The proposed approach advances practical long-video understanding by reducing computational overhead while retaining semantic coherence, making it feasible for real-world deployment.

Abstract

Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.

Paper Structure

This paper contains 13 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparisons between existing video RAG strategies and our OneClip-RAG. OneClip-RAG unifies video chunking and clip retrieval in one unified paradigm based on cross-modal similarities, providing coherent frames for efficient long video understanding.
  • Figure 2: Overview of OneClip-RAG. (a) As a plug-and-play design, OneClip-RAG first performs clip chunking based on the given video and input instruction, and then selects the most relevant video clips for augmented video understanding of MLLMs. (b) OneClip-RAG uses the cross-modal similarities between frames and text instructions to depict the changes of video content, and then determines the boundaries for video clip chunking. (c) OneClip-RAG can directly select the most relevant clips for MLLMs, requiring no additional models.
  • Figure 3: Statistical overview of the proposed SynLongVideo dataset. SynLongVideo aims to improve the instruction-following capability of clip retrieval models for long video understanding. In addition to available long video-question data BarmannW22, it also synthesizes 430 long videos via visually and textually data mix-ups of short videos. The dataset statistics is given in the left table, and its main semantics and data distributions are shown in the middle and right graphs.
  • Figure 4: Efficiency and performance comparison between OneClip-RAG and other SOTA Video-MLLMs on MLVU. OneClip achieves superior performance with greater efficiency.
  • Figure 5: Visualized comparisons between our OneClip-RAG and other Video-RAG methods. The green letters are ground-truth answers, and the green dotted boxes indicate the frames of the long video that is related to the user's instruction.