Table of Contents
Fetching ...

DARO: Difficulty-Aware Reweighting Policy Optimization

Jingyu Zhou, Lu Ma, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang

TL;DR

This work introduces Difficulty-Aware Reweighting Policy Optimization (DARO), a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state, achieving significantly faster convergence and superior final performance.

Abstract

Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.

DARO: Difficulty-Aware Reweighting Policy Optimization

TL;DR

This work introduces Difficulty-Aware Reweighting Policy Optimization (DARO), a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state, achieving significantly faster convergence and superior final performance.

Abstract

Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.

Paper Structure

This paper contains 20 sections, 26 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The Dynamic Adaptive Reweighting Policy Optimization(DARO) Framework. This figure illustrates the training pipeline where, for each prompt, an empirical pass rate ($\mu_i$) and a base loss ($\mathcal{L}_i$) are computed using verifier rewards. Unlike methods that use static weighting functions based on $\mu$ (e.g., GRPO, DAPO, Dr. GRPO and LIPO), DAPO treats the weights ($w_\mu$) as optimizable parameters. This approach allows the weights to adapt dynamically to the model's state, as visualized by the evolving weight distributions across training steps (top).
  • Figure 2: Loss Scale Issue observed in the GRPO training process of three base models, exponentially smoothed with $\alpha=0.1$. The solid line represents the true loss value, while the dashed line represents the loss value approximately obtained through equation \ref{['eq:theorem_1']}. Detailed hyperparameters are shown in Section \ref{['sec:settings']}.
  • Figure 3: Normalized Response Lengths of Qwen2.5-Math-7B model in the GRPO training process. The responce lengths are normalized by dividing the sum of responce lengths across the batch $B$ at each step. The solid and dashed lines represents the responce lengths of positive ($r=1$) and negative ($r=0)$ samples, respectively.
  • Figure 4: Entropy, responce length and train reward dynamic of Qwen2.5-Math-7B model throughout training.
  • Figure 5: Average passrates of all settings. It is clear that DARO not only converges faster than other methods but also has the highest passrate during the whole training process.
  • ...and 1 more figures