Table of Contents
Fetching ...

Reinforcing Video Reasoning with Focused Thinking

Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nannan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, Tat-Seng Chua

TL;DR

This work tackles two core drawbacks in reinforcement learning for multimodal LLMs applied to video reasoning: verbose, unfocused reasoning and sparse binary rewards. It introduces TW-GRPO, combining token-level importance weighting based on distributional divergence with multi-level soft rewards, and adds a data-augmentation method (Question-Answer Inversion) to enable multi-choice QA. Empirical results across six video benchmarks show state-of-the-art performance on CLEVRER, NExT-GQA, and MMVU, along with faster convergence and more concise reasoning. The approach provides a practical pathway toward more efficient, focused video reasoning in multimodal LLMs and highlights the value of token-level signals and soft, graded supervision in RL settings.

Abstract

Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.

Reinforcing Video Reasoning with Focused Thinking

TL;DR

This work tackles two core drawbacks in reinforcement learning for multimodal LLMs applied to video reasoning: verbose, unfocused reasoning and sparse binary rewards. It introduces TW-GRPO, combining token-level importance weighting based on distributional divergence with multi-level soft rewards, and adds a data-augmentation method (Question-Answer Inversion) to enable multi-choice QA. Empirical results across six video benchmarks show state-of-the-art performance on CLEVRER, NExT-GQA, and MMVU, along with faster convergence and more concise reasoning. The approach provides a practical pathway toward more efficient, focused video reasoning in multimodal LLMs and highlights the value of token-level signals and soft, graded supervision in RL settings.

Abstract

Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.

Paper Structure

This paper contains 40 sections, 29 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: TW-GRPO integrates focused thinking and soft multi-level rewards for multi-choice QA. Unlike vanilla thinking which assigns uniform token importance, focused thinking highlight critical tokens to dominate loss calculation. By shifting single-choice QA’s binary rewards to multi-choice QA’s multi-level rewards, TW-GRPO enables fine-grained gradient estimation and training efficiency.
  • Figure 2: Overview of the TW-GRPO framework. The diagram shows the key steps in a forward pass, starting from the video input, generating possible completions, and calculating the reward with adjustments for the final objective and model updates. Specifically, a multi-level soft reward is incorporated into the reward calculation, providing partial correctness feedback. These signals are then integrated into the final objective, where token-level importance weighting is applied, allowing the model to prioritize more informative tokens and improve overall performance.
  • Figure 3: Training dynamics of different GRPO variants. (a) TW-GRPO achieves faster convergence in reward standard deviation, indicating more stable and efficient learning. (b) It also produces consistently shorter output lengths, reflecting more concise and effective reasoning than other methods.
  • Figure 4: Comparison of reasoning paths from T-GRPO and TW-GRPO on MMVU samples.
  • Figure A1: Analysis of the influence of the TW-GRPO weighting coefficient $\alpha$ on model performance.
  • ...and 4 more figures