Table of Contents
Fetching ...

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang

TL;DR

The paper tackles long video reasoning by moving beyond text-only chain-of-thought to multimodal chain-of-thought guided by a visual toolbox that samples frames on demand. It introduces VITAL, an end-to-end agentic framework that jointly trains a multimodal LLM and a visual toolbox using multi-task objectives and the Difficulty-aware GRPO (DGRPO) algorithm. Two large-scale datasets, MTVR-CoT-72k and MTVR-RL-110k, support supervised fine-tuning and reinforcement learning for temporal grounding and VQA tasks, with evidence that tool-augmented multimodal reasoning reduces hallucination and improves long-video performance. Empirical results across 11 benchmarks show state-of-the-art performance in long-video QA and grounding, underscoring the value of on-demand visual evidence and adaptive RL for robust video understanding.

Abstract

The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/.

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

TL;DR

The paper tackles long video reasoning by moving beyond text-only chain-of-thought to multimodal chain-of-thought guided by a visual toolbox that samples frames on demand. It introduces VITAL, an end-to-end agentic framework that jointly trains a multimodal LLM and a visual toolbox using multi-task objectives and the Difficulty-aware GRPO (DGRPO) algorithm. Two large-scale datasets, MTVR-CoT-72k and MTVR-RL-110k, support supervised fine-tuning and reinforcement learning for temporal grounding and VQA tasks, with evidence that tool-augmented multimodal reasoning reduces hallucination and improves long-video performance. Empirical results across 11 benchmarks show state-of-the-art performance in long-video QA and grounding, underscoring the value of on-demand visual evidence and adaptive RL for robust video understanding.

Abstract

The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/.

Paper Structure

This paper contains 38 sections, 5 equations, 17 figures, 14 tables, 1 algorithm.

Figures (17)

  • Figure 1: Performance on long video temporal grounding benchmark VUE-TR. VITAL-7B (w/o) denotes VITAL-7B without toolbox. VITAL-7B achieves state-of-the-art.
  • Figure 2: Comparison between text-based CoT (left) and multimodal CoT (right) on temporal grounding task. Green text denotes correct inference and orange text denotes wrong inference. "Thinking with tools" reduces hallucination in the reasoning process by integrating relevant, densely sampled video clip frames into multimodal CoT, resulting in more accurate grounding.
  • Figure 3: Overview of the Video Intelligence Tool-Augmented Learning (VITAL) framework. In the multi-round generation process, the model can attend to video tools adaptively and integrate the tool result to form a multimodal CoT. The model is optimized with Difficulty-aware Group Relative Policy Optimization (DGRPO).
  • Figure 4: Data generation pipeline of MTVR training dataset. A rollout filter is applied to improve data quality.
  • Figure 5: Task distribution of MTVR training dataset.
  • ...and 12 more figures