Table of Contents
Fetching ...

rStar2-Agent: Agentic Reasoning Technical Report

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang

TL;DR

rStar2-Agent presents a 14B math-reasoning model trained with agentic reinforcement learning to achieve frontier performance previously associated with much larger models. The approach combines a scalable high-throughput Python tool environment, the GRPO-RoC algorithm to mitigate environment noise and reward sparsity, and a compute-efficient, multi-stage training recipe starting from non-reasoning SFT. It delivers 80.6% pass@1 on AIME24 and 69.8% on AIME25 with only 510 RL steps on 64 MI300X GPUs, outperforming larger models in several benchmarks and generalizing to alignment and scientific reasoning tasks. The work demonstrates that agentic RL with tool use can unlock advanced reasoning in smaller, cost-effective models and outlines a path for extending such methods to broader domains.

Abstract

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

rStar2-Agent: Agentic Reasoning Technical Report

TL;DR

rStar2-Agent presents a 14B math-reasoning model trained with agentic reinforcement learning to achieve frontier performance previously associated with much larger models. The approach combines a scalable high-throughput Python tool environment, the GRPO-RoC algorithm to mitigate environment noise and reward sparsity, and a compute-efficient, multi-stage training recipe starting from non-reasoning SFT. It delivers 80.6% pass@1 on AIME24 and 69.8% on AIME25 with only 510 RL steps on 64 MI300X GPUs, outperforming larger models in several benchmarks and generalizing to alignment and scientific reasoning tasks. The work demonstrates that agentic RL with tool use can unlock advanced reasoning in smaller, cost-effective models and outlines a path for extending such methods to broader domains.

Abstract

We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

Paper Structure

This paper contains 21 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 2: rStar2-Agent trains LLMs to natively use Python coding tools within the dedicated execution environment, enabling more advanced and effective reasoning for complex problem-solving.
  • Figure 3: Our prompt template. Question will be replaced with the specific question during training.
  • Figure 4: Proportion of tool calls that contain errors within correctly answered trajectories. Under naive GRPO, the error rate initially decreases but soon plateaus at a significant level. In contrast, our GRPO-RoC continues to reduce tool-related errors with more training steps.
  • Figure 5: The overall design of our agentic reinforcement learning infrastructure.
  • Figure 6: Our code environment demonstrates scalability by reliably handling up tp 45K concurrent tool calls per step, while maintaining consistently low end-to-end latency from dispatch to response.
  • ...and 5 more figures