Table of Contents
Fetching ...

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

Jixiao Zhang, Chunsheng Zuo

TL;DR

GRPO-LEAD presents a difficulty-aware reinforcement learning framework that builds on Group Relative Policy Optimization to improve mathematical reasoning. It introduces a length-dependent accuracy reward, an explicit penalty for incorrect answers, and a difficulty-aware advantage reweighting scheme to address reward sparsity, verbosity, and uneven task difficulty. Across 7B and 14B models, GRPO-LEAD delivers faster convergence, higher accuracy, and more concise outputs, achieving state‑of‑the‑art performance on AIME benchmarks for 14B-scale models. The results underscore the importance of data quality, curriculum design, and targeted reward shaping for robust, scalable mathematical reasoning in large language models.

Abstract

Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

TL;DR

GRPO-LEAD presents a difficulty-aware reinforcement learning framework that builds on Group Relative Policy Optimization to improve mathematical reasoning. It introduces a length-dependent accuracy reward, an explicit penalty for incorrect answers, and a difficulty-aware advantage reweighting scheme to address reward sparsity, verbosity, and uneven task difficulty. Across 7B and 14B models, GRPO-LEAD delivers faster convergence, higher accuracy, and more concise outputs, achieving state‑of‑the‑art performance on AIME benchmarks for 14B-scale models. The results underscore the importance of data quality, curriculum design, and targeted reward shaping for robust, scalable mathematical reasoning in large language models.

Abstract

Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.

Paper Structure

This paper contains 30 sections, 11 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The GRPO-LEAD framework assigns length-regularized positive rewards to correct answers and explicit penalties to incorrect ones. A difficulty-based weight $w$ used for advantage reweighting is determined from the empirical correctness of responses for each question. This weight then scales the advantages derived from each question, prioritizing harder questions over easier ones during the policy update to foster robust reasoning.
  • Figure 2: Validation$^*$ Pass@1 over training steps for three configurations: GRPO, GRPO+L, and GRPO+LAD. As shown by the faster convergence, length reward and advantage reweighting provide a richer reward signal signal than the original setup.
  • Figure 3: Performance against inference budget for training done with different ablations of LEAD. GRPO with length reward (GRPO+L) largely enhances the performance at low budget settings compared to before training (DeepseekR1-7B).