DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Haowen Gao; Zhenyu Zhang; Liang Pang; Fangda Guo; Hongjian Dou; Guannan Lv; Shaoguo Liu; Tingting Gao; Huawei Shen; Xueqi Cheng

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng

TL;DR

DIVA-GRPO is proposed, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective that alleviates reward sparsity and advantage vanishing while improving training stability.

Abstract

Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

TL;DR

Abstract

Paper Structure (103 sections, 7 theorems, 30 equations, 36 figures, 7 tables)

This paper contains 103 sections, 7 theorems, 30 equations, 36 figures, 7 tables.

Introduction
Challenge: Reward Sparsity and Advantage Vanishing
Group Relative Policy Optimization (GRPO).
Reward Sparsity and Advantage Vanishing.
Motivation.
Difficulty-Adaptive Variant Advantage with GRPO
Difficulty Assessment of Samples
Difficulty-Adaptive Variant Generation
Difficulty-Weighted and Normalized Advantage Balancing
(1) Local–Global imbalance.
(2) Difficulty-weighted scaling.
Reward-Range-Based Advantage Rescaling (RRB-Rescaling)
Experiments
Experimental Setup
Main Results (RQ1: Effectiveness)
...and 88 more sections

Key Result

Theorem B.1

Let $g(\theta)$ be the stochastic policy gradient estimator where $A_t$ denotes the advantage function. Suppose $\mathbb{E}[g(\theta)] = \nabla J(\theta)$ is unbiased and $\text{Var}[g(\theta)] < \infty$. Then for step size $\eta > 0$, the expected squared error satisfies This inequality shows that the convergence rate depends critically on the variance of the gradient estimator. Reducing gradie

Figures (36)

Figure 1: (a) Selective sample utilization relies on only a subset of data, leading to underuse. (b) Sample enhancement expands data without difficulty awareness, causing even severe advantage sparsity. (c) Our method adaptively expands the sample space by problem difficulty, ensuring a stable difficulty distribution.
Figure 2: Overview of the proposed DIVA-GRPO method. For a given question, we dynamically assess its difficulty based on past rollout rewards and adaptively sample variants of different difficulty levels. As shown, when the original question is hard, easier variants are sampled to ensure reward diversity. We then compute local (the question itself) and global (the question with its variants) advantages, and obtain the final advantage through difficulty-aware reweighting and reward-range rescaling to update the policy model.
Figure 3: RQ1: Effectiveness of the DIVA-GRPO
Figure 4: RQ3: Effectiveness of RRB on General GRPO Methods
Figure 5: RQ4: Impact of DIVA-GRPO on Efficiency and Speed
...and 31 more figures

Theorems & Definitions (12)

Theorem B.1: Gradient Variance Control
Corollary B.2: Advantage Normalization and Difficulty-Weighted Balancing
Lemma C.1: advantages for binary rewards
proof
Lemma C.2: batch update projection onto $v$
proof
Theorem C.3: optimality of $\mu=\tfrac{1}{2}$
proof
Corollary C.4: Case A: opposite-class gradients
proof
...and 2 more

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

TL;DR

Abstract

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (36)

Theorems & Definitions (12)