Table of Contents
Fetching ...

SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning

Xiong Jun Wu, Zhenduo Zhang, ZuJie Wen, Zhiqiang Zhang, Wang Ren, Lei Shi, Cai Chen, Deng Zhao, Qing Wang, Xudong Han, Chengfu Tang, Dingnan Jin, Qing Cui, Jun Zhou

TL;DR

SHARP tackles the bottleneck of scarce, high-quality STEM reasoning data for large reasoning models by coupling a principled self-alignment strategy with a three-phase Alignment–Instantiation–Inference framework and an RLVR training loop. It combines a seed-topic three-tier taxonomy, rigorous verification, and a state-of-the-art LRM to generate and validate graduate- or Olympiad-level problems, then refines model reasoning through reinforcement learning. Empirical results on GPQA show SHARP-enhanced distillation and RL Zero training outperform baselines, bringing complex STEM reasoning closer to expert-level proficiency, with strong coverage across physics, chemistry, and biology. The work suggests SHARP as a scalable pathway to substantially elevate LRM capabilities in rigorous STEM domains, with potential applicability beyond the initial benchmarks.

Abstract

Training large reasoning models (LRMs) with reinforcement learning in STEM domains is hindered by the scarcity of high-quality, diverse, and verifiable problem sets. Existing synthesis methods, such as Chain-of-Thought prompting, often generate oversimplified or uncheckable data, limiting model advancement on complex tasks. To address these challenges, we introduce SHARP, a unified approach to Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards (RLVR). SHARP encompasses a strategic set of self-alignment principles -- targeting graduate and Olympiad-level difficulty, rigorous logical consistency, and unambiguous, verifiable answers -- and a structured three-phase framework (Alignment, Instantiation, Inference) that ensures thematic diversity and fine-grained control over problem generation. We implement SHARP by leveraging a state-of-the-art LRM to infer and verify challenging STEM questions, then employ a reinforcement learning loop to refine the model's reasoning through verifiable reward signals. Experiments on benchmarks such as GPQA demonstrate that SHARP-augmented training substantially outperforms existing methods, markedly improving complex reasoning accuracy and pushing LRM performance closer to expert-level proficiency. Our contributions include the SHARP strategy, framework design, end-to-end implementation, and experimental evaluation of its effectiveness in elevating LRM reasoning capabilities.

SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning

TL;DR

SHARP tackles the bottleneck of scarce, high-quality STEM reasoning data for large reasoning models by coupling a principled self-alignment strategy with a three-phase Alignment–Instantiation–Inference framework and an RLVR training loop. It combines a seed-topic three-tier taxonomy, rigorous verification, and a state-of-the-art LRM to generate and validate graduate- or Olympiad-level problems, then refines model reasoning through reinforcement learning. Empirical results on GPQA show SHARP-enhanced distillation and RL Zero training outperform baselines, bringing complex STEM reasoning closer to expert-level proficiency, with strong coverage across physics, chemistry, and biology. The work suggests SHARP as a scalable pathway to substantially elevate LRM capabilities in rigorous STEM domains, with potential applicability beyond the initial benchmarks.

Abstract

Training large reasoning models (LRMs) with reinforcement learning in STEM domains is hindered by the scarcity of high-quality, diverse, and verifiable problem sets. Existing synthesis methods, such as Chain-of-Thought prompting, often generate oversimplified or uncheckable data, limiting model advancement on complex tasks. To address these challenges, we introduce SHARP, a unified approach to Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards (RLVR). SHARP encompasses a strategic set of self-alignment principles -- targeting graduate and Olympiad-level difficulty, rigorous logical consistency, and unambiguous, verifiable answers -- and a structured three-phase framework (Alignment, Instantiation, Inference) that ensures thematic diversity and fine-grained control over problem generation. We implement SHARP by leveraging a state-of-the-art LRM to infer and verify challenging STEM questions, then employ a reinforcement learning loop to refine the model's reasoning through verifiable reward signals. Experiments on benchmarks such as GPQA demonstrate that SHARP-augmented training substantially outperforms existing methods, markedly improving complex reasoning accuracy and pushing LRM performance closer to expert-level proficiency. Our contributions include the SHARP strategy, framework design, end-to-end implementation, and experimental evaluation of its effectiveness in elevating LRM reasoning capabilities.

Paper Structure

This paper contains 18 sections, 3 equations, 27 figures, 8 tables, 1 algorithm.

Figures (27)

  • Figure 1: The SHARP Approach
  • Figure 2: The SHARP Framework
  • Figure 3: The SHARP Implementation for Large Reasoning Models Reinforcement Learning
  • Figure 4: GPQA score improvement of single STEM disciplines (physics, chemistry, and biology) of SHARP-Qwen2.5-7B-Instruct-Distill relative to benchmark model DeepSeek-R1-Distill-Qwen-7B in overall ablation of STEM data generated by the SHARP approach and fused with some open-source mathematical data. (The $x$-axis represents the different epochs run during the training of the distill models, and the $y$-axis represents the GPQA score evaluation results corresponding to the checkpoints of the models generated at different epochs).
  • Figure 5: GPQA score improvement of single STEM disciplines (physics, chemistry, and biology) of SHARP-Qwen2.5-7B-Instruct-Distill relative to benchmark model DeepSeek-R1-Distill-Qwen-7B in ablation of STEM data generated by the SHARP approach. (The meanings of the $x$ and $y$ axes are the same as those in Fig.\ref{['fig:sharp_distill_gpqa_compare']}.)
  • ...and 22 more figures