Difficulty-Estimated Policy Optimization
Yu Zhao, Fan Jiang, Tianle Liu, Bo Zeng, Yu Liu, Longyue Wang, Weihua Luo
TL;DR
DEPO tackles the high rollout cost in reasoning-focused RLVR by introducing an online Difficulty Estimator that filters training data before rollouts. Built on a BERT-based encoder with dual heads, DEPO jointly optimizes advantage estimation, distillation, and ranking losses to predict sample difficulty and align with the actor's capabilities, thereby mitigating zero-variance gradients in GRPO. Empirical results show DEPO achieves around a $1.5\%$ uplift in Avg@32 over GRPO while delivering up to a $2\times$ speedup over DAPO and a substantial reduction in total computational overhead. The approach is plug-and-play, complementary to existing methods, and extends naturally to routing queries across heterogeneous models, offering a scalable path for reasoning scaling in large LLMs.
Abstract
Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.
