Table of Contents
Fetching ...

VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, Wentao Zhang

TL;DR

VADE tackles gradient vanishing in group-based multimodal RL by introducing online sample-level difficulty estimation with Beta posteriors, a Thompson sampler guided by an information-gain objective, and a two-scale prior decay to track policy evolution. The data selection problem is modeled as a non-stationary Multi-Armed Bandit, enabling proactive, online sampling without extra rollouts and serving as a plug-in to GRPO/GSPO. Empirical results across MathVista, MathVerse, MathVision, ScienceQA, and ChartQA show VADE delivers superior sample efficiency and final performance while reducing rollout costs, with ablations validating each component's contribution. The approach offers a practical, scalable enhancement for multimodal reasoning RL tasks, broadening the applicability of efficient group-based training.

Abstract

Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.

VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

TL;DR

VADE tackles gradient vanishing in group-based multimodal RL by introducing online sample-level difficulty estimation with Beta posteriors, a Thompson sampler guided by an information-gain objective, and a two-scale prior decay to track policy evolution. The data selection problem is modeled as a non-stationary Multi-Armed Bandit, enabling proactive, online sampling without extra rollouts and serving as a plug-in to GRPO/GSPO. Empirical results across MathVista, MathVerse, MathVision, ScienceQA, and ChartQA show VADE delivers superior sample efficiency and final performance while reducing rollout costs, with ablations validating each component's contribution. The approach offers a practical, scalable enhancement for multimodal reasoning RL tasks, broadening the applicability of efficient group-based training.

Abstract

Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.

Paper Structure

This paper contains 17 sections, 4 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Proportion of data yielding effective training signals throughout training. Left: Under standard GRPO with random sampling, the proportion of informative samples providing non-zero gradients diminishes over time. Right: Our proposed Thompson sampling strategy consistently maintains a high ratio of effective data by proactively selecting informative samples.
  • Figure 2: Overview of the VADE framework. Our method maintains distributions $\text{Beta}(\alpha_i + 1, \beta_i+ 1)$ for each sample to enable online difficulty estimation. Through Thompson sampling and InfoGain $\mathcal{I}_i = p_t(1-p_t)^2$ maximization, VADE dynamically selects informative batches for group-wise rollouts. The two-scale prior decay mechanism ensures estimates remain accurate throughout policy evolution.
  • Figure 3: Training dynamics of Qwen2.5VL-7B-Instruct trained with the GRPO algorithm. (a) Validation score, illustrating convergence speed and final performance; (b) Effective gradient ratio, representing the proportion of data with non-uniform rewards (neither all-zero nor all-one) in each training batch, reflecting data efficiency; (c) Actor gradient norm, indicating the magnitude of gradient signals throughout training.
  • Figure 4: Training dynamics comparison between our method and DAPO. (a) Validation Score. The x-axis represents the total count of forward passes through the model to sample responses for all training batches. Our method achieves competitive or superior validation performance with significantly fewer rollout generations, demonstrating substantially higher training efficiency compared to DAPO. (b)Cumulative Rollout Generations throughout training. This figure plots the total number of forward passes performed to sample responses for all training batches up to a given step. DAPO incurs significantly more rollout generations than our VADE due to its over-sampling and filtering strategy, demonstrating the superior computational efficiency of our approach.
  • Figure 5: Training dynamics of Qwen2.5VL-7B-Instruct trained with the GSPO algorithm. (a) Validation score, illustrating convergence speed and final performance; (b) Effective gradient ratio, representing the proportion of data with non-uniform rewards (neither all-zero nor all-one) in each training batch, reflecting data efficiency; (c) Actor gradient norm, indicating the magnitude of gradient signals throughout training.
  • ...and 3 more figures