Table of Contents
Fetching ...

GASP: Guided Asymmetric Self-Play For Coding LLMs

Swadesh Jana, Cansu Sancaktar, Tomáš Daniš, Georg Martius, Antonio Orvieto, Pavel Kolev

Abstract

Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.

GASP: Guided Asymmetric Self-Play For Coding LLMs

Abstract

Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.
Paper Structure (37 sections, 6 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 37 sections, 6 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Overview of GASP. Self-play is guided by hard real-data coding questions that a standard RLVR run fails to solve. We refer to this subset as our goalpost questions. The teacher first generates an easy variant (lemma) of this goalpost and then a harder variant (lift), producing progressively more challenging questions that push the model's knowledge boundary.
  • Figure 2: Illustration of GASP. Iterative training on generated lemma and lift questions expands the student's knowledge boundary, while the generated questions move closer to the goalpost $h$.
  • Figure 3: Teacher rewards for lemma and lift proposals. We show the learnability curves according to Eq. \ref{['eq:lemma_reward']} and Eq. \ref{['eq:lift_reward']}, where the vertical lines mark the corresponding valid bands.
  • Figure 4: Pass@k performance on the LCB eval benchmark (LCBv5). We compare GASP, GASP + Real-data RL, Real-data RL, AZR, and Qwen2.5-Coder-7B. All RL-based results are repeated over three seeds. For each seed of our RL runs, we perform single-checkpoint model selection by choosing the checkpoint that maximizes pass@20 on the LCBv5 evaluation split, and report mean $\pm$ std across seeds. For AZR, we evaluate the authors' publicly released checkpoint (trained from Qwen2.5-Coder-7B); the authors report selecting their checkpoint based on best performance on their evaluation benchmarks (including LCB) over training from a single-seed run. We additionally report pass@1-based checkpoint selection in the appendix (Figure \ref{['fig:gasp_rj_best_by_1']}), which shows the same qualitative trends.
  • Figure 5: Goalpost questions solved with GASP and GASP + Real-data RL. Visualization over training checkpoints (x-axis: global step) and goalpost question IDs (y-axis). We include only goalpost questions that are solved at least once during training. A cell is marked as solved (green) if pass@$100 > 0$ (i.e., at least one of 100 samples passes); otherwise it is unsolved (red). Results are aggregated across three seeds, where a cell is marked solved if any seed solves the question at that checkpoint (union across seeds).
  • ...and 11 more figures