ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Hosu Lee; Junho Kim; Hyunjun Kim; Yong Man Ro

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

TL;DR

ReFoCUS reframes video understanding by optimizing frame-level input selection with reinforcement learning, rather than tuning textual outputs. It uses an autoregressive frame-subset policy guided by a margin-based reward from a frozen reward model, enabling the model to internalize its own visual preferences and attend to temporally relevant cues. The approach yields consistent performance gains across diverse video QA benchmarks and demonstrates improved knowledge acquisition, while analyses confirm semantically grounded and task-adaptive frame selections. Despite computational costs and potential reward-model biases, input-level alignment offers a scalable path toward deeper, more context-aware video reasoning in LMMs.

Abstract

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to understand video content remains constrained by suboptimal frame selection strategies. Existing approaches often rely on static heuristics or external retrieval modules to feed frame information into video-LLMs, which may fail to provide the query-relevant information. In this work, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), a novel frame-level policy optimization framework that shifts the optimization target from textual responses to visual input selection. ReFoCUS learns a frame selection policy via reinforcement learning, using reward signals derived from a reference LMM to reflect the model's intrinsic preferences for frames that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive, conditional selection architecture that ensures temporal coherence while reducing complexity. Our approach does not require explicit supervision at the frame-level and consistently improves reasoning performance across multiple video QA benchmarks, highlighting the benefits of aligning frame selection with model-internal utility.

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

TL;DR

Abstract

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)