Table of Contents
Fetching ...

RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?

Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song

TL;DR

DELTA introduces a controlled RL benchmark with synthetic programming problem families to study learnability and generalization in LLMs. It demonstrates a grokking phase transition under staged training with dense rewards, showing RL can uncover new procedural strategies on otherwise unsolvable tasks. The generalization analysis reveals strong transfer along Exploratory and Compositional axes but persistent challenges in Transformative cases, highlighting both promise and limits of RL-driven reasoning. The work emphasizes how training design choices—such as staged warm-up, experience replay, and verification-in-loop—shape the discovery of novel algorithmic skills and points to avenues for applying these insights to math and science domains.

Abstract

It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code -- Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding -- a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability -- can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? -- and transferrability -- if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?

TL;DR

DELTA introduces a controlled RL benchmark with synthetic programming problem families to study learnability and generalization in LLMs. It demonstrates a grokking phase transition under staged training with dense rewards, showing RL can uncover new procedural strategies on otherwise unsolvable tasks. The generalization analysis reveals strong transfer along Exploratory and Compositional axes but persistent challenges in Transformative cases, highlighting both promise and limits of RL-driven reasoning. The work emphasizes how training design choices—such as staged warm-up, experience replay, and verification-in-loop—shape the discovery of novel algorithmic skills and points to avenues for applying these insights to math and science domains.

Abstract

It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code -- Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding -- a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability -- can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? -- and transferrability -- if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

Paper Structure

This paper contains 48 sections, 1 theorem, 5 equations, 11 figures, 2 tables.

Key Result

Theorem 1

Setup. Let $P_o$ and $P_i$ be two concentric regular $n$-gons ($n\ge 3$) with circumradii $R_o>R_i>0$. Both polygons rotate rigidly with the same constant angular velocity $\omega$ about their common center. At time $t=0$ a point mass (“ball”) is placed on the inward normal to a side of $P_o$ and mo Thus $\Delta$ is the (signed) distance between the parallel supporting lines of the corresponding s

Figures (11)

  • Figure 1: Overview of DELTA with controlled RL studies. Left: Synthetic Programming Problem families—Manufactoria with custom syntax and puzzle-like rules, BounceSim with physical simulation, etc. Right: Controlled RL experiments. Top: Learnability shows grokking, where RL shifts from long exploration to sudden convergence, uncovering strategies beyond reference models. Bottom: Generalization extends OMEGA sun2025omega across four axes—Exploratory, Compositional, Transformative, and Domain-level—testing adaptation to harder or recombined tasks.
  • Figure 2: The Manufactoria difficulty ladder. 14 problem families are grouped into Basic, Easy, Medium, and Hard levels according to average performance across four popular LLMs. Each test split contains 20–50 problems, and full pass rate are averaged over 4 independent runs.
  • Figure 3: Full-pass rate (%) on BouncingSim by model, family (ROT_OBJ, ROT_BOX, MOV_BOX, GRAVITY, MULTI_BOX, MULTI_OBJ), and difficulty tier (Basic$\rightarrow$Extreme). Warmer colors denote higher accuracy; cell values are mean full-pass rates per split over 4 runs on 50 test problems each.
  • Figure 4: Pass@k comparison before and after RL training on the Manufactoria-HAS.
  • Figure 5: Comparison of strategies solving "pass@K=0" tasks. (a) Directly optimizing for full-pass rate under GRPO fails. (b) Training with a per-test pass rate provides a smoother reward but quickly saturates. (c) A two-phase training—warming up with per-test pass rate, then switching to full-pass reward. All training is performed on Manufactoria-HAS family and the reference model Qwen3-4B-Instruct-2507.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1: Periodic bounce between two concentric regular $n$-gons