Table of Contents
Fetching ...

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

TL;DR

ReFoCUS reframes video understanding by optimizing frame-level input selection with reinforcement learning, rather than tuning textual outputs. It uses an autoregressive frame-subset policy guided by a margin-based reward from a frozen reward model, enabling the model to internalize its own visual preferences and attend to temporally relevant cues. The approach yields consistent performance gains across diverse video QA benchmarks and demonstrates improved knowledge acquisition, while analyses confirm semantically grounded and task-adaptive frame selections. Despite computational costs and potential reward-model biases, input-level alignment offers a scalable path toward deeper, more context-aware video reasoning in LMMs.

Abstract

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

TL;DR

ReFoCUS reframes video understanding by optimizing frame-level input selection with reinforcement learning, rather than tuning textual outputs. It uses an autoregressive frame-subset policy guided by a margin-based reward from a frozen reward model, enabling the model to internalize its own visual preferences and attend to temporally relevant cues. The approach yields consistent performance gains across diverse video QA benchmarks and demonstrates improved knowledge acquisition, while analyses confirm semantically grounded and task-adaptive frame selections. Despite computational costs and potential reward-model biases, input-level alignment offers a scalable path toward deeper, more context-aware video reasoning in LMMs.

Abstract

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

Paper Structure

This paper contains 49 sections, 13 equations, 15 figures, 5 tables, 2 algorithms.

Figures (15)

  • Figure 1: Pipeline overview of ReFoCUS. The policy model $\pi_{\theta}$ samples $N$ candidate frame subsets $F$ from the input video $v$ and question $q$, and the reward model $r_{\varphi}$ evaluates each subset using its prediction confidence, producing reward signals to train $\pi_{\theta}$ via policy gradient.
  • Figure 2: Distribution of reward variance $\mathrm{Var}(m)$ (prediction margin) across 962K QA pairs. We observe that many samples yield low variance, indicating weak sensitivity to visual input. We filter out such cases ($<\tau=0.21$) to retain a high-quality subset for policy learning.
  • Figure 3: Overview of the ReFoCUS framework. Given a video and query, $\pi_\theta$ autoregressively selects $N$ frame subsets, which are then scored by $r_\varphi$ based on their answer prediction margins. The resulting rewards guide frame-level policy optimization via reinforcement learning.
  • Figure 4: Prediction ratio relative to the baseline accuracy, across over-k% (dashed) and under-k% (solid) frame subsets from each bin.
  • Figure 5: Result of V-NIAH. (a) Uniform sampling (b) Frame selection of ReFoCUS. The $x$-axis denotes the total #video frames, and the $y$-axis indicates the relative position of the needle frame.
  • ...and 10 more figures