Table of Contents
Fetching ...

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang

TL;DR

VerIPO tackles the challenge of cultivating long-term reasoning in Video-LLMs by introducing a verifier-guided, iterative policy loop that alternates between GRPO-based exploration and DPO-based refinement. A Rollout-Aware Verifier judges CoT quality and consistency, generating high-quality contrastive samples (including reflective paths) to train via DPO, achieving faster and more stable improvements than GRPO alone. The approach yields longer, more contextually coherent CoTs and surpasses instruction-tuned Video-LLMs and several strong RL baselines on multiple video-reasoning benchmarks. The results demonstrate robust long-form reasoning capabilities with improved accuracy and reduced inconsistencies, suggesting practical impact for complex video reasoning under efficient training costs.

Abstract

Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

TL;DR

VerIPO tackles the challenge of cultivating long-term reasoning in Video-LLMs by introducing a verifier-guided, iterative policy loop that alternates between GRPO-based exploration and DPO-based refinement. A Rollout-Aware Verifier judges CoT quality and consistency, generating high-quality contrastive samples (including reflective paths) to train via DPO, achieving faster and more stable improvements than GRPO alone. The approach yields longer, more contextually coherent CoTs and surpasses instruction-tuned Video-LLMs and several strong RL baselines on multiple video-reasoning benchmarks. The results demonstrate robust long-form reasoning capabilities with improved accuracy and reduced inconsistencies, suggesting practical impact for complex video reasoning under efficient training costs.

Abstract

Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.

Paper Structure

This paper contains 25 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Figures (A, D): Initial GRPO training with different data types shows only utilizing Video-QA data decreases response length. Figures (B, E): Continual GRPO training with/without Verifier-guided DPO (VerIPO) demonstrates VerIPO improves accuracy and response length. Figure (C): Inconsistency rate (thinking vs. final answer) at different stages reveals our method lowers contextual inconsistency of long CoTs while GRPO increases it. Figure (F): Performance on challenging video reasoning dataset VSI-Bench yang2024think shows VerIPO (trained with Qwen2.5-VL-7B) outperforms strong LMMs including GPT-4o hurst2024gpt, Video-R1 feng2025videor1reinforcingvideoreasoning, and Kimi-VL kimiteam2025kimivltechnicalreport.
  • Figure 2: Overview of VerIPO workflow. This training loop is guided by the Verifier's continuous evaluation and selection of training samples. The optimization process progressively improves the model's long reasoning capability by learning from high-quality and informative reasoning examples.
  • Figure 3: Figure (A): Performance comparison after removing Reflective Preference Pairs and Inference Consistency Pairs during DPO (I-2) stage. The reported values represent the average metric across the MMVU (mc) and TOMATO. For visualization, the response length has been scaled down to 0.25 of the original. Figure (B): Inconsistency rate (thinking vs. final answer) at Cold Start and different stages of VerIPO. The reported values represent the average scores across the MMVU (mc) and TOMATO. The statistical inconsistency rate is in \ref{['inconsistency_rate']}. Figure (C): The number of repeated responses generated by VerIPO at different training stages over the evaluation datasets. The reported values are computed as the sum of VSI-Bench, Video-MMMU, MMVU (mc) and TOMATO.
  • Figure 4: A case from Video-MMMU shows the comparative performance of GRPO and VerIPO. Our method can generate longer CoTs with accurate and logical formulas to solve physical problems.
  • Figure 5: A case from VSI-Bench shows the comparative performance of GRPO and VerIPO. Our method is capable of generating longer responses and employing self-validation to address spatial reasoning tasks.
  • ...and 4 more figures