Table of Contents
Fetching ...

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu

TL;DR

VR-Thinker tackles the limits of visual reward models by enabling thinking with image guided reasoning and a configurable visual memory window. It introduces a three stage training pipeline—Cold Start, Rejection sampling Fine-Tuning, and GRPO—to cultivate robust multimodal reasoning that revisits visual evidence. Empirical results on open-source benchmarks show state-of-the-art accuracy, with notable gains on long videos due to active frame retrieval and memory management. The work advances reward modeling for video generation by integrating visual reasoning into the feedback loop, with promising implications for alignment and reliability.

Abstract

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

TL;DR

VR-Thinker tackles the limits of visual reward models by enabling thinking with image guided reasoning and a configurable visual memory window. It introduces a three stage training pipeline—Cold Start, Rejection sampling Fine-Tuning, and GRPO—to cultivate robust multimodal reasoning that revisits visual evidence. Empirical results on open-source benchmarks show state-of-the-art accuracy, with notable gains on long videos due to active frame retrieval and memory management. The work advances reward modeling for video generation by integrating visual reasoning into the feedback loop, with promising implications for alignment and reliability.

Abstract

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Paper Structure

This paper contains 42 sections, 18 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) shows the main process of our proposed Thinking-with-Image framework. (b) shows an overview of the three training stages we proposed, including Cold Start, Rejection sampling Fine-Tuning, and GRPO.
  • Figure 2: Qualitative Cases. When frames are downsampled, key information might not be included in the input. VR-Thinker actively retrieves frames, which ensures the correctness of such cases.
  • Figure 3: The results of ablation studies are summarized in this figure: (1) investigates the ablation of visual reasoning; (2) examines the impact of different training stages on the final model performance; (3) explores ablations of different auxiliary reward settings; and (4) studies the ablation of different accuracy reward signals by our modification of the accuracy reward.
  • Figure 4: The training dynamics of the GRPO stage: (1) accuracy on GenAI-Bench throughout training; (2) average tool invocations per sample; (3) average reasoning segment length.
  • Figure 5: The results of the hyperparameter search and the reject fine-tuning data volume comparison are summarized in this figure: (a) shows parameter search for $\alpha$; (b) shows parameter search for $k$; (c) shows comparison across rejection sampling fine-tuning data volumes.