Table of Contents
Fetching ...

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan, Xiaoyi Yu, Jiaze Li, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan

TL;DR

A novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding.

Abstract

Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

TL;DR

A novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding.

Abstract

Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.
Paper Structure (50 sections, 8 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 50 sections, 8 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: A direct comparison among static key-frame sampling algorithms, trainable key-frame sampler, and our proposed MLLM-Sampler Joint Evolution framework (MSJoE).
  • Figure 2: Our dataset construction pipeline. QA pairs with low difficulty or poor quality are removed during multi-stage filtering.
  • Figure 3: The proposed MSJoE framework. Given a video and question, MSJoE generates reasoning-based queries from a sparse preview, matches them against dense frames via CLIP to create a similarity matrix, and uses a lightweight U-Sampler to select informative frames. The MLLM then processes these key frames at high resolution for answer generation. The entire framework is jointly optimized through end-to-end reinforcement learning.
  • Figure 4: Ablation studies on varying input frames (x-axis). Four methods are evaluated: MSJoE in light violet, Top-$k$ in red, and Uniform Sampling uniform sampling in gray.
  • Figure 5: Three frame sets from a publicly available video generated by different sampling strategies. Question: What motivated her to change dietary habits? (A) Family and friends (B) Diabetes (C) Tooth decay (D) Anemia.
  • ...and 3 more figures