How Does RL Post-training Induce Skill Composition? A Case Study on Countdown
Simon Park, Simran Kaur, Sanjeev Arora
TL;DR
This work disentangles length generalization from compositional generalization in RL post-training for LLMs by introducing a pattern-based framework that treats Countdown solutions as computation trees. Through controlled dataset generation and canonical pattern mappings, the authors show that RL confers length generalization and partial compositional generalization, with learnability strongly shaped by pattern structure rather than sheer problem depth. They identify a lookahead bottleneck that makes right-heavy, highly sequential patterns particularly hard, and demonstrate that RL enables synthesis of unseen compositional patterns, evidencing genuine compositional generalization beyond pass@k. The findings motivate curriculum designs that strategically expose structurally hard patterns to push the reasoning frontier, and offer diagnostic tools that reveal how and when RL shapes skill composition in reasoning tasks. Overall, the study clarifies the structural factors governing generalization in RL-tuned models and provides a framework for evaluating and advancing compositional reasoning beyond conventional metrics.
Abstract
While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
