Table of Contents
Fetching ...

When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu

Abstract

Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.

When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

Abstract

Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.
Paper Structure (29 sections, 13 equations, 7 figures, 8 tables)

This paper contains 29 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison between traditional inference and FrameRepeat inference. In traditional inference, video frames are uniformly sampled and directly fed into the MLLM. In contrast, the proposed FrameRepeat first scores the frames during inference, selecting the frames with the highest scores, which are then repeated at their original positions and passed into the MLLM for enhanced understanding.
  • Figure 2: Empirical analysis of frame repetition.(a) Attention ratio allocated to a repeated frame across generation steps. When a frame is repeated, it occupies two visual slots in the input sequence—the original slot and the copy slot—whose attention contributions are shown separately. Repetition significantly increases the model's overall attention to the target frame. (b) Per-sample best-frame (green) and worst-frame (red) $\Delta$log-prob, showing that repetition benefit is highly frame-dependent. (c) Thinking-mode accuracy when repeating Top-K, Random-K, or Bottom-K frames selected by direct-mode $\Delta$log-prob. Top-K consistently outperforms other strategies, validating the effectiveness of $\Delta$log-prob as a frame importance signal.
  • Figure 3: Overview of the FrameRepeat framework. During training, we use the Add-One-In strategy to measure the MLLM output probability for each repeated frame. The resulting probability variations are used to compute the repeat gain, which trains the repeat scoring module. During inference, the video frames and question are passed through the scoring module to estimate the repetition importance of each frame. The top-k most important frames are then selected for repetition and fed into the MLLM for the final answer.
  • Figure 4: Comparison of frames selected using FrameRepeat score versus CLIP score.
  • Figure 5: Training dynamics of the repeat scoring module. (a) The training loss decreases steadily, indicating stable convergence. (b) The score distribution standard deviation increases consistently, demonstrating that the model progressively learns to differentiate key frames from non-essential ones.
  • ...and 2 more figures