Table of Contents
Fetching ...

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong

TL;DR

VL-Cogito introduces PCuRL, a Progressive Curriculum Reinforcement Learning framework that stabilizes and enhances multimodal reasoning across diverse domains. By integrating Online Difficulty Soft Weighting and Dynamic Length Reward, the method guides learning from easy to hard tasks and adapts reasoning length to problem complexity, without requiring cold-start SFT. Extensive experiments across mathematics, science, logic, and general vision benchmarks show state-of-the-art or highly competitive performance and robust ablations confirm the value of each component. The results demonstrate improved reasoning depth, efficiency, and training stability, with practical impact for reliable multimodal reasoning systems.

Abstract

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

TL;DR

VL-Cogito introduces PCuRL, a Progressive Curriculum Reinforcement Learning framework that stabilizes and enhances multimodal reasoning across diverse domains. By integrating Online Difficulty Soft Weighting and Dynamic Length Reward, the method guides learning from easy to hard tasks and adapts reasoning length to problem complexity, without requiring cold-start SFT. Extensive experiments across mathematics, science, logic, and general vision benchmarks show state-of-the-art or highly competitive performance and robust ablations confirm the value of each component. The results demonstrate improved reasoning depth, efficiency, and training stability, with practical impact for reliable multimodal reasoning systems.

Abstract

Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

Paper Structure

This paper contains 25 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An overview of the proposed Progressive Curriculum Reinforcement Learning (PCuRL) framework. It consists of two key components: (1) a multi-stage curriculum RL structure that utilizes online difficulty soft weighting, which partitions the training progress into different stages based on task difficulty; (2) a dynamic length reward mechanism that encourages the model to adapt its reasoning chain length according to task complexity, rather than indiscriminately increasing it. In the Easy stage, the model tends to assign higher weights to relatively easier questions for policy optimization, a pattern that similarly applies to the Medium and Hard stages.
  • Figure 2: Three difficulty distributions, i.e., easy, medium, and hard, for the Online Difficulty Soft weighting (ODSW).
  • Figure 3: Performance comparison of models trained with different length reward strategies. "Dynamic-$N$" denotes models employing our dynamic length reward with a target length of $N$ during the final stage of curriculum RL. "Fix-$N$" refers to models trained with a fixed-length reward that enforces the fixed target length of $N$ across all responses. We visualize both the average response length and the overall accuracy across selected benchmarks.
  • Figure 4: Training curves for PCuRL (with a target response length of $500$ tokens) and vanilla GRPO. The average reward curve indicates the mean reward of sampled responses during training. The validation accuracy curve shows model performance on a held-out validation set (around $1,000$ questions, split from the original training set at initialization) as measured by the accuracy reward function. The average length curve displays the mean response length of sampled outputs during training.
  • Figure 5: Case studies of VL-Cogito, where samples are drawn from multiple benchmarks, including MMStar, ScienceQA, Geometry@3K, and MathVision.