Table of Contents
Fetching ...

How Does RL Post-training Induce Skill Composition? A Case Study on Countdown

Simon Park, Simran Kaur, Sanjeev Arora

TL;DR

This work disentangles length generalization from compositional generalization in RL post-training for LLMs by introducing a pattern-based framework that treats Countdown solutions as computation trees. Through controlled dataset generation and canonical pattern mappings, the authors show that RL confers length generalization and partial compositional generalization, with learnability strongly shaped by pattern structure rather than sheer problem depth. They identify a lookahead bottleneck that makes right-heavy, highly sequential patterns particularly hard, and demonstrate that RL enables synthesis of unseen compositional patterns, evidencing genuine compositional generalization beyond pass@k. The findings motivate curriculum designs that strategically expose structurally hard patterns to push the reasoning frontier, and offer diagnostic tools that reveal how and when RL shapes skill composition in reasoning tasks. Overall, the study clarifies the structural factors governing generalization in RL-tuned models and provides a framework for evaluating and advancing compositional reasoning beyond conventional metrics.

Abstract

While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.

How Does RL Post-training Induce Skill Composition? A Case Study on Countdown

TL;DR

This work disentangles length generalization from compositional generalization in RL post-training for LLMs by introducing a pattern-based framework that treats Countdown solutions as computation trees. Through controlled dataset generation and canonical pattern mappings, the authors show that RL confers length generalization and partial compositional generalization, with learnability strongly shaped by pattern structure rather than sheer problem depth. They identify a lookahead bottleneck that makes right-heavy, highly sequential patterns particularly hard, and demonstrate that RL enables synthesis of unseen compositional patterns, evidencing genuine compositional generalization beyond pass@k. The findings motivate curriculum designs that strategically expose structurally hard patterns to push the reasoning frontier, and offer diagnostic tools that reveal how and when RL shapes skill composition in reasoning tasks. Overall, the study clarifies the structural factors governing generalization in RL-tuned models and provides a framework for evaluating and advancing compositional reasoning beyond conventional metrics.

Abstract

While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.

Paper Structure

This paper contains 65 sections, 4 equations, 27 figures, 16 tables.

Figures (27)

  • Figure 1: From generated expression to canonical pattern. The model-generated expression $8 + (2 \times 9) \div 3$ is parsed into an expression tree, normalized with respect to algebraic identities, and mapped to the unique canonical pattern $A \times B \div C + D$. Patterns are then grouped by their computational shape. Here, the root operator is "$+$", with a left sub-expression over three leaves $(A,B,C)$ and a right sub-expression over one leaf $(D)$. The signature $[3]+[1]$ serves as a compact representation of the pattern's structure at the root-level.
  • Figure 2: Illustration of length and compositional generalization. While length and compositional generalization are not mutually exclusive, they measure different abilities. A study of length generalization focuses solely on the number of skills, whereas a study of compositional generalization requires an analysis of the compositional structure.
  • Figure 3: Successful length generalization, but a gap between pattern discovery and reliable execution. We train Qwen2.5-1.5B on problem sizes $n \in \{3, 4\}$ and evaluate on held-out data for $n \in \{3,4,5\}$. (Top) The model demonstrates strong length generalization: accuracy on the held-out $n=5$ problems improves over the base model (leftmost data point). (Bottom) The model also identifies almost all compositional patterns for $n=5$, but the ability to reliably execute the correct pattern lags significantly behind its discovery. This indicates that the primary challenge for generalization is procedural reliability, not abstract pattern identification.
  • Figure 4: Compositional structure (not input length) determines difficulty. Within $n=4$, balanced patterns ($[2]\circ[2]$; purple) are discovered (top row) and mastered more reliably (bottom row) than the unbalanced ones (red/blue). Balanced patterns have shallow trees, with solution depth of 2 equal to that of $n=3$ puzzles. Even when controlling for depth, right-heavy patterns ($[1]\circ[3]$; blue) are harder than left-heavy patterns ($[3]\circ[1]$; red). This evidence points to a "lookahead bottleneck," the challenge of committing to a solution ahead of a complex subroutine. The entire structural hierarchy persists for $n=5$. Results shown for a representative run of Qwen2.5-1.5B. See \ref{['appendix:plots']} for equivalent plots for all other runs.
  • Figure 5: Generalizes to entire families of held-out compositional patterns. To evaluate compositional generalization, we removed an entire family of related patterns from the training set: the base $n=3$ structure $A/B+C$ and all of its $n=4$ extensions (e.g., $A/B+C+D$, $A/(B+C)+D$). The model successfully recovers these unseen patterns. Coverage first emerges on the held-out $n=3$ subpattern before generalizing to its more complex $n=4$ dependents, matching the typical learning hierarchy where presence precedes precision. This result demonstrates that the model can reuse learned substructures to assemble novel operator-tree shapes it has never been explicitly trained on.
  • ...and 22 more figures