Table of Contents
Fetching ...

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

Qingnan Ren, Shiting Huang, Zhen Fang, Zehui Chen, Lin Chen, Lijun Li, Feng Zhao

TL;DR

ADORA tackles slow convergence and unstable learning in RL-based reasoning models by dynamically reweighting per-sample advantages during online rollouts. It introduces a unified TAS/TDS framework with Length Advantage and Difficulty Advantage criteria, applying modality-specific weighting: attenuation for weak VLMs and amplification for strong LLMs, all while preserving unbiased policy updates. Across VLM geometry reasoning and LLM math reasoning tasks, ADORA achieves consistent gains with minimal hyperparameter tuning and strong data efficiency, demonstrated on diverse model families and benchmarks. The approach is plug-and-play with GRPO, offering scalable improvements in long-horizon reasoning and generalization for multimodal and text-based reasoning systems.

Abstract

Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

TL;DR

ADORA tackles slow convergence and unstable learning in RL-based reasoning models by dynamically reweighting per-sample advantages during online rollouts. It introduces a unified TAS/TDS framework with Length Advantage and Difficulty Advantage criteria, applying modality-specific weighting: attenuation for weak VLMs and amplification for strong LLMs, all while preserving unbiased policy updates. Across VLM geometry reasoning and LLM math reasoning tasks, ADORA achieves consistent gains with minimal hyperparameter tuning and strong data efficiency, demonstrated on diverse model families and benchmarks. The approach is plug-and-play with GRPO, offering scalable improvements in long-horizon reasoning and generalization for multimodal and text-based reasoning systems.

Abstract

Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.
Paper Structure (41 sections, 8 equations, 14 figures, 7 tables)

This paper contains 41 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of vanilla GRPO vs. integration with ADORA for the training of Qwen models.
  • Figure 2: Distribution of Reasoning-Related Keywords for ADORA and vanilla GRPO.
  • Figure 3: Hyperparameter ablation of $\tau$, $\lambda_{\text{att}}$ and $\lambda_{\text{amp}}$
  • Figure 4: Comparison between DAPO baseline and ADORA.
  • Figure 5: Comparison between training with 2K samples and 10K samples.
  • ...and 9 more figures