Table of Contents
Fetching ...

How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

Yunjie Ji, Sitong Zhao, Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, Xiangang Li

TL;DR

This study tackles the challenge of enhancing LLM reasoning by introducing difficulty-aware staged reinforcement learning. It combines three core ideas: (i) selecting training data by measured difficulty using multiple model pass rates and rigorous verification, (ii) a staged RL regimen that progressively increases task difficulty, and (iii) cross-domain training by mixing mathematical reasoning and code-generation tasks. Empirical results show meaningful gains on math benchmarks ($e.g.$, $AIME-2024$ and $MATH-500$) and cross-domain improvements when training on both math and coding prompts, including notable increases such as $13.4\%$ on $AIME-2024$ and $5.6\%$ on $MATH-500$. The work highlights practical strategies for scaling reasoning capabilities in LLMs and underscores ongoing challenges around data-quality, difficulty-score reliability, and computational costs, while offering open-source datasets for community use.

Abstract

Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3\% on the AIME-2024 benchmark, 89.5\% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.

How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

TL;DR

This study tackles the challenge of enhancing LLM reasoning by introducing difficulty-aware staged reinforcement learning. It combines three core ideas: (i) selecting training data by measured difficulty using multiple model pass rates and rigorous verification, (ii) a staged RL regimen that progressively increases task difficulty, and (iii) cross-domain training by mixing mathematical reasoning and code-generation tasks. Empirical results show meaningful gains on math benchmarks (, and ) and cross-domain improvements when training on both math and coding prompts, including notable increases such as on and on . The work highlights practical strategies for scaling reasoning capabilities in LLMs and underscores ongoing challenges around data-quality, difficulty-score reliability, and computational costs, while offering open-source datasets for community use.

Abstract

Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3\% on the AIME-2024 benchmark, 89.5\% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.

Paper Structure

This paper contains 15 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Performance of the model on the AIME-2024 benchmark during training with three different difficulty levels.
  • Figure 2: Performance of the model during staged RL training. The plot shows the model's score over time, with stage 1 (blue) and stage 2 (orange). The vertical dashed line indicates the point (step 1600) where the model transitioned from stage 1 to stage 2, resulting in performance improvements.
  • Figure 3: Our model, trained using a two-stage reinforcement learning (RL) approach, shows significant performance improvements on two mathematics-related benchmarks, AIME-2024 and MATH-500. However, due to the absence of code-related training data, its performance on LiveCodeBench is essentially the same as that of the base model.
  • Figure 4: Performance comparison of models across different benchmarks: AIME-2024, MATH-500, and LiveCodeBench. Our model is simultaneously trained on math and code data.