Table of Contents
Fetching ...

FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai

TL;DR

FrameMind tackles the rigidity of fixed-frame inputs in video understanding by introducing FiCOT, a frame-interleaved reasoning framework. It combines Dynamic Resolution Frame Sampling (DRFS) with a group-relative policy optimization (DRFS-GRPO) to learn when to scan broadly versus focus spatially, guided by outcome-based rewards. The approach achieves state-of-the-art performance among open-source models on MVBench, MLVU, and VideoMME, while using far fewer frames thanks to adaptive perception. This work advances practical, efficient video reasoning by treating perception as an active tool-driven process rather than a static preprocessing step.

Abstract

Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.

FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning

TL;DR

FrameMind tackles the rigidity of fixed-frame inputs in video understanding by introducing FiCOT, a frame-interleaved reasoning framework. It combines Dynamic Resolution Frame Sampling (DRFS) with a group-relative policy optimization (DRFS-GRPO) to learn when to scan broadly versus focus spatially, guided by outcome-based rewards. The approach achieves state-of-the-art performance among open-source models on MVBench, MLVU, and VideoMME, while using far fewer frames thanks to adaptive perception. This work advances practical, efficient video reasoning by treating perception as an active tool-driven process rather than a static preprocessing step.

Abstract

Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.

Paper Structure

This paper contains 43 sections, 10 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of a static, text-only CoT with our dynamic Frame-Interleaved CoT (FiCOT). (a) The conventional approach relies on a single, fixed scan of the video, resulting in insufficient spatial detail and an incorrect guess silver. (b) FiCOT actively identifies its knowledge gap and uses its toolbox to retrieve a high-resolution snippet and specific frames, leading to a grounded and correct answer bright pink.
  • Figure 2: Overall framework of FrameMind, illustrating the iterative perception-reasoning loop. The agent first thinks, then acts (calls tools) to gather visual evidence, and updates its understanding to inform the next cycle.
  • Figure 3: Effect of the exploration bonus.(Left) reward/tool and (Right) reward/accuracy over training steps. With the exploration bonus (blue), the curves take off earlier and converge to a higher plateau on both metrics; without it (red), learning is slower and plateaus lower.
  • Figure F1: FrameMind Sample 1.
  • Figure F2: FrameMind Sample 2.