How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

Yunjie Ji; Sitong Zhao; Xiaoyu Tian; Haotian Wang; Shuaiting Chen; Yiping Peng; Han Zhao; Xiangang Li

How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

Yunjie Ji, Sitong Zhao, Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, Xiangang Li

TL;DR

This study tackles the challenge of enhancing LLM reasoning by introducing difficulty-aware staged reinforcement learning. It combines three core ideas: (i) selecting training data by measured difficulty using multiple model pass rates and rigorous verification, (ii) a staged RL regimen that progressively increases task difficulty, and (iii) cross-domain training by mixing mathematical reasoning and code-generation tasks. Empirical results show meaningful gains on math benchmarks ($e.g.$, $AIME-2024$ and $MATH-500$) and cross-domain improvements when training on both math and coding prompts, including notable increases such as $13.4\%$ on $AIME-2024$ and $5.6\%$ on $MATH-500$. The work highlights practical strategies for scaling reasoning capabilities in LLMs and underscores ongoing challenges around data-quality, difficulty-score reliability, and computational costs, while offering open-source datasets for community use.

Abstract

Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3\% on the AIME-2024 benchmark, 89.5\% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.

How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

TL;DR

Abstract

How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)