Table of Contents
Fetching ...

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

Yucheng Suo, Fan Ma, Linchao Zhu, Tianyi Wang, Fengyun Rao, Yi Yang

TL;DR

This work tackles long video understanding under frame-limited inference by generating multiple predictions through bin-wise visual-context sampling and selecting the best one with a self-reward score. The self-reward combines a frequency-based vote, inter-intra sample marginal confidence, and an adaptive contextual voting mechanism that tailors reasoning for global versus local questions. Across seven long-video benchmarks and three base MLLMs, the approach yields consistent gains (e.g., up to +$4.28\%$ on Video-MME and +$5.89\%$ on MLVU) while maintaining training-free inference. The method broadens visual perception for long videos and offers a practical, resource-efficient path to improved understanding in multimodal systems.

Abstract

Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

TL;DR

This work tackles long video understanding under frame-limited inference by generating multiple predictions through bin-wise visual-context sampling and selecting the best one with a self-reward score. The self-reward combines a frequency-based vote, inter-intra sample marginal confidence, and an adaptive contextual voting mechanism that tailors reasoning for global versus local questions. Across seven long-video benchmarks and three base MLLMs, the approach yields consistent gains (e.g., up to + on Video-MME and + on MLVU) while maintaining training-free inference. The method broadens visual perception for long videos and offers a practical, resource-efficient path to improved understanding in multimodal systems.

Abstract

Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.

Paper Structure

This paper contains 15 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Pipeline and our motivation. Our pipeline generates multiple predictions for long video questions and selects the best prediction using a self-reward score. This approach is inspired by a coverage test, which demonstrates that Pass@10 exceeds the accuracy of majority answer by $15\%$ using the Llava-Video model.
  • Figure 2: Scaling sampling experiments. (a) Coverage test using different models, (b) comparison between global and bin-wise visual context sampling, (c) coverage test using fewer frames.
  • Figure 3: Method Overview. Our method first adopts bin-wise visual context sampling to generate multiple predictions and then selects the final prediction with the largest self-reward score. The score consists of three components including the frequency score $\mathcal{S}^f$, marginal confidence score $\mathcal{S}^{mc}$, and a voting score $\mathcal{S}^v$ is calculated based on the complementary answer.
  • Figure 4: Percentage of questions with divergent predictions.
  • Figure 5: Influence of $\alpha$ and $\beta$ using Llava-video on Video-MME.