Table of Contents
Fetching ...

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, Jingzhao Zhang

TL;DR

QuestA presents a data-centric augmentation that injects partial-solution hints into hard prompts during RL training to scaffold mathematical reasoning. By prepending the first $p$% of a solution, QuestA creates a learnable curriculum that improves sample efficiency and expands reasoning capacity for 1.5B-scale models, achieving new state-of-the-art results on AIME24, AIME25, and HMMT25. The method is plug-and-play with existing RL pipelines and demonstrates strong generalization, including test-time performance without hints. Theoretical analysis shows hints reshape the learnability landscape, reducing the sampling budget required to discover informative trajectories, and empirical results confirm improved pass@k curves and broader problem coverage. Overall, QuestA offers a practical path to enhance reasoning in LLMs through targeted, scalable data augmentation.

Abstract

Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RL's ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively? To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50% (+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on HMMT25. Code, data and model are available at https://github.com/foreverlasting1202/QuestA.

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

TL;DR

QuestA presents a data-centric augmentation that injects partial-solution hints into hard prompts during RL training to scaffold mathematical reasoning. By prepending the first % of a solution, QuestA creates a learnable curriculum that improves sample efficiency and expands reasoning capacity for 1.5B-scale models, achieving new state-of-the-art results on AIME24, AIME25, and HMMT25. The method is plug-and-play with existing RL pipelines and demonstrates strong generalization, including test-time performance without hints. Theoretical analysis shows hints reshape the learnability landscape, reducing the sampling budget required to discover informative trajectories, and empirical results confirm improved pass@k curves and broader problem coverage. Overall, QuestA offers a practical path to enhance reasoning in LLMs through targeted, scalable data augmentation.

Abstract

Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RL's ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively? To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50% (+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on HMMT25. Code, data and model are available at https://github.com/foreverlasting1202/QuestA.

Paper Structure

This paper contains 36 sections, 6 theorems, 21 equations, 18 figures, 7 tables.

Key Result

Theorem 4.4

Given a probability threshold $\delta_p > 0$, if for every question $q \in \mathcal{Q}$, the model capacity set $C(q, \delta_p)$ does not intersect with the solution set $\mathcal{S}(q)$, i.e., then under Assumption assum:RL_algorithm, when training RL for $T$ steps with $B$ samples per step such that $TB = \Theta(1/\delta_p)$, there is a constant probability that the RL algorithm will not update

Figures (18)

  • Figure 1: QuestA is a data augmentation method that injects partial solutions to effectively scaffold RL training on hard reasoning problems. We construct 26K high-quality augmented prompts from challenging instances in OpenR1 openr1_math220k, and fine-tune models using 32K-context-length RL. When applied to Nemotron-1.5B, QuestA delivers substantial performance gains—achieving new state-of-the-art results across all math benchmarks for 1.5B-parameter models.
  • Figure 2: We compare pass@k curves of RLVR-trained models, with and without QuestA. As a controlled experiment, we perform RL training using either easy or hard prompts. Standard RL on easy prompts (red) shows clear degradation in pass@k as $k$ increases compared to the base model (blue). Training on hard prompts (green) improves pass@k, but comes at the cost of substantially longer training. This motivates our development of QuestA, which scaffolds hard problems to improve training efficiency and delivers consistently stronger results: the RL+QuestA model (orange) stays above standard RL (red) across all $k$, while also preserving or improving performance at larger $k$ relative to RL trained with hard prompts.
  • Figure 3: Comparison of RL training dynamics: Training with only hard problems (green) makes progress very slowly due to sparse rewards, while our method with partial solutions (orange) accelerates training and consistently achieves higher accuracy across training steps.
  • Figure 4: QuestA augments each original question in the dataset by prepending the first $p\%$ of the solution sketch. In our experiments, we apply augmentation using the solution block rather than the reasoning chain-of-thought. The hint percentage $p$ is computed as the ratio of tokens used as hints to the total number of tokens in the solution sketch.
  • Figure 5: Training dynamics of QuestA-Nemotron-1.5B. The first and second charts show the progression of average response length and average reward across rollout samples during the RL process, both of which steadily increase over time. The third chart presents the average entropy. Interestingly, the entropy increases over time, suggesting that QuestA does not suffer from entropy collapse and instead encourages diverse and exploratory behavior.
  • ...and 13 more figures

Theorems & Definitions (12)

  • Definition 4.1: Solution Set
  • Definition 4.2: Model Capacity Set
  • Theorem 4.4: Lower Bound on RL Learnability under Solution Inaccessibility
  • Definition 4.5: Question Augmentation
  • Theorem 4.6: Informal Upper Bound on RL Learnability with Hint
  • Theorem B.1: Lower Bound on RL Learnability under Solution Inaccessibility
  • proof
  • Lemma B.0: Upper Bound on Sampling Budget for Solution Given Hint
  • proof
  • Theorem B.1: Informal Upper Bound on RL Learnability with Hint
  • ...and 2 more