Difficulty-Estimated Policy Optimization

Yu Zhao; Fan Jiang; Tianle Liu; Bo Zeng; Yu Liu; Longyue Wang; Weihua Luo

Difficulty-Estimated Policy Optimization

Yu Zhao, Fan Jiang, Tianle Liu, Bo Zeng, Yu Liu, Longyue Wang, Weihua Luo

TL;DR

DEPO tackles the high rollout cost in reasoning-focused RLVR by introducing an online Difficulty Estimator that filters training data before rollouts. Built on a BERT-based encoder with dual heads, DEPO jointly optimizes advantage estimation, distillation, and ranking losses to predict sample difficulty and align with the actor's capabilities, thereby mitigating zero-variance gradients in GRPO. Empirical results show DEPO achieves around a $1.5\%$ uplift in Avg@32 over GRPO while delivering up to a $2\times$ speedup over DAPO and a substantial reduction in total computational overhead. The approach is plug-and-play, complementary to existing methods, and extends naturally to routing queries across heterogeneous models, offering a scalable path for reasoning scaling in large LLMs.

Abstract

Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.

Difficulty-Estimated Policy Optimization

TL;DR

uplift in Avg@32 over GRPO while delivering up to a

speedup over DAPO and a substantial reduction in total computational overhead. The approach is plug-and-play, complementary to existing methods, and extends naturally to routing queries across heterogeneous models, offering a scalable path for reasoning scaling in large LLMs.

Abstract

Paper Structure (32 sections, 9 equations, 8 figures, 3 tables)

This paper contains 32 sections, 9 equations, 8 figures, 3 tables.

Introduction
Preliminary
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)
Existing Methods for Mitigating the Zero-Variance Problem of GRPO
DEPO
Online Difficulty Estimator
Model Architecture
Training Objective
Advantage Estimation Loss
Distillation Loss
Ranking Loss
"Cold-Start" Problem of the Difficulty Estimator
Experiments
Experimental Settings
...and 17 more sections

Figures (8)

Figure 1: Top: the overview of our proposed DEPO framework. Bottom: Training dynamics of downstream accuracy of GRPO and DEPO.
Figure 2: Architectural overview of our proposed DEPO algorithm. DEPO utilizes a Difficulty Estimator to predict advantages $\hat{A}_i$ for sampled questions. Samples with non-zero estimated advantages ($\hat{A}_i \neq 0$) are employed for updating the Actor Model using the standard GRPO algorithm, while those with zero advantage are filtered out to optimize training efficiency. The Difficulty Estimator is simultaneously updated using the computed advantages from the GRPO rollouts as the ground truth.
Figure 3: The architecture of our proposed online difficulty estimator.
Figure 4: The comparison between the predicted rewards from the estimator and the ground-truth target rewards derived from the actor model. The estimator effectively converges, demonstrating a high degree of fidelity in tracking the target reward trajectory throughout the training process.
Figure 5: Training dynamics of filtering ratios when training Qwen2.5-7B-Instruct on datasets of varying difficulty.
...and 3 more figures

Difficulty-Estimated Policy Optimization

TL;DR

Abstract

Difficulty-Estimated Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (8)