Table of Contents
Fetching ...

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao

TL;DR

The paper tackles the rigidity of static, single-pass video understanding by introducing ReAgent-V, a reward-driven, agentic framework that performs entropy-calibrated frame selection, tool-augmented reasoning, and critic-guided reflection during inference. Real-time reward signals enable iterative answer refinement from conservative, neutral, and aggressive perspectives, plus data selection for SFT, DPO, and GRPO to continually improve performance. Across 12 datasets and three core tasks—video understanding, video LLM reasoning, and VLA alignment—the approach achieves notable gains (up to 9.8%), while maintaining efficiency and modularity. This work demonstrates a scalable, extensible path to robust video reasoning and alignment without relying on costly annotations or static reward templates.

Abstract

Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

TL;DR

The paper tackles the rigidity of static, single-pass video understanding by introducing ReAgent-V, a reward-driven, agentic framework that performs entropy-calibrated frame selection, tool-augmented reasoning, and critic-guided reflection during inference. Real-time reward signals enable iterative answer refinement from conservative, neutral, and aggressive perspectives, plus data selection for SFT, DPO, and GRPO to continually improve performance. Across 12 datasets and three core tasks—video understanding, video LLM reasoning, and VLA alignment—the approach achieves notable gains (up to 9.8%), while maintaining efficiency and modularity. This work demonstrates a scalable, extensible path to robust video reasoning and alignment without relying on costly annotations or static reward templates.

Abstract

Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.

Paper Structure

This paper contains 37 sections, 5 equations, 24 figures, 5 tables, 1 algorithm.

Figures (24)

  • Figure 1: Overview of the ReAgent-V framework: The system first selects relevant video frames based on the input question through the entropy-calibrated frame selection module and invokes various tools from the tool factory to assist in reasoning. The target agent generates an initial answer using the selected tools and input context, which is then critically evaluated by the critic agent through questioning and scoring, ultimately producing a comprehensive feedback report. The target agent can then revise its answer from three perspectives—conservative, neutral, and aggressive—based on the report and updated context. In addition to generating standard reasoning outputs, high-scoring data identified during the reflection process based on the feedback report is stored for training algorithms such as SFT, DPO, and GRPO, thereby further enhancing model performance.
  • Figure 2: Frame selection analysis (VideoMME VideoID: 24i4ncHuf6A, QuestionID:005-2) shows entropy, CLIP score, and ECRS across frames; red lines highlight the most relevant frames.
  • Figure 3: Comparison of ReAgent-V with OpenVLA and other reward method on the same data on the Simpler-Env environment.
  • Figure 4: Comparison of four reflection strategies ($t_a$, $t_n$, $t_c$, and ReAgent-V), where Corr. Rate denotes the frequency of answer revision and Corr. Acc indicates the accuracy of those revisions.
  • Figure 5: A case study demonstrating how ReAgent-V enhances video understanding through iterative reasoning and tool use (see additional examples in Appendix \ref{['appendix : vis_results']}).
  • ...and 19 more figures