GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models
Jixiao Zhang, Chunsheng Zuo
TL;DR
GRPO-LEAD presents a difficulty-aware reinforcement learning framework that builds on Group Relative Policy Optimization to improve mathematical reasoning. It introduces a length-dependent accuracy reward, an explicit penalty for incorrect answers, and a difficulty-aware advantage reweighting scheme to address reward sparsity, verbosity, and uneven task difficulty. Across 7B and 14B models, GRPO-LEAD delivers faster convergence, higher accuracy, and more concise outputs, achieving state‑of‑the‑art performance on AIME benchmarks for 14B-scale models. The results underscore the importance of data quality, curriculum design, and targeted reward shaping for robust, scalable mathematical reasoning in large language models.
Abstract
Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.
