Table of Contents
Fetching ...

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu

TL;DR

This work tackles gradient-vanishing in reinforcement-learning fine-tuning for multimodal reasoning by introducing Variance-Aware Sampling (VAS), which promotes reward variance through the Variance Promotion Score (VPS) built from Outcome Variance (OVS) and Trajectory Diversity (TDS). The authors provide theoretical guarantees linking reward variance to expected policy-gradient improvements and demonstrate stability and performance gains on math and logic benchmarks. They also release large-scale, carefully curated cold-start CoT data (~1.6M) and RL data (~15k), plus open-source models and a reproducible end-to-end training codebase, establishing standardized baselines for the community. Empirical results show that VAS improves convergence, stability, and downstream reasoning capabilities, with ablations clarifying the complementary roles of OVS and TDS. Overall, the work contributes practical data resources and a principled sampling strategy to advance stable, variance-aware reinforcement learning for multimodal reasoning.

Abstract

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

TL;DR

This work tackles gradient-vanishing in reinforcement-learning fine-tuning for multimodal reasoning by introducing Variance-Aware Sampling (VAS), which promotes reward variance through the Variance Promotion Score (VPS) built from Outcome Variance (OVS) and Trajectory Diversity (TDS). The authors provide theoretical guarantees linking reward variance to expected policy-gradient improvements and demonstrate stability and performance gains on math and logic benchmarks. They also release large-scale, carefully curated cold-start CoT data (~1.6M) and RL data (~15k), plus open-source models and a reproducible end-to-end training codebase, establishing standardized baselines for the community. Empirical results show that VAS improves convergence, stability, and downstream reasoning capabilities, with ablations clarifying the complementary roles of OVS and TDS. Overall, the work contributes practical data resources and a principled sampling strategy to advance stable, variance-aware reinforcement learning for multimodal reasoning.

Abstract

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

Paper Structure

This paper contains 58 sections, 4 theorems, 24 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let Assumption as:smooth hold and use the optimal baseline $b^\star(x)$. Then for any step size $0<\eta \le \tfrac{c_{\min}}{4L\Gamma_\theta(x)},$ the expected one-step gain satisfies

Figures (4)

  • Figure 1: Overview of the Variance-Aware Sampling (VAS) framework.
  • Figure 2: Training efficiency of Variance-Aware Sampling (VAS). The plots compare three settings: full VAS sampling ($\lambda=1.0$, orange), mixed sampling with half VAS and half random ($\lambda=0.5$, blue), and the vanilla baseline (purple). Left: Actor gradient norm, reflecting the magnitude of gradient signals during training. Middle: Policy gradient clip fraction, indicating the proportion of updates reaching the clipping boundary. Right: Validation accuracy, showing convergence speed and final performance.
  • Figure 3: Dynamics of Variance Promotion Score (VPS) during training. The top row illustrates the distribution of VPS values across data points at different training steps. The bottom row shows transition matrices of VPS assignments between consecutive update intervals, where each cell indicates the number of data points moving from a source bin (vertical axis) at the earlier step to a target bin (horizontal axis) at the later step. The arrows indicate the direction from lower to higher VPS bins, facilitating interpretation of upward or downward transitions.
  • Figure 4: Qualitative demonstration of MMR1’s reasoning process on a MathVerse problem. The figure illustrates the input question, the model’s step-by-step thinking process, and the final answer. The reasoning is logically structured, including problem analysis, solution planning, execution, verification, and alternative approaches, ultimately arriving at the correct answer ($140^\circ$).

Theorems & Definitions (7)

  • Theorem 1: Variance–Progress
  • Proposition 1: Optimal action-independent baseline
  • proof
  • Lemma 1: Variance sandwich bound
  • proof
  • Proposition 2: Uniform Fisher bounds
  • proof