Chunky Post-Training: Data Driven Failures of Generalization
Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, Sara Price
TL;DR
Chunky post-training identifies data-driven, surface-feature routing as a class of generalization failures arising from discrete training data chunks. The authors introduce SURF, a black-box auditor that surfaces rubric-violating behaviors by iteratively reweighting semantic prompt attributes, and TURF, a data-attribution pipeline that traces those behaviors back to training-data features. They demonstrate widespread chunky behaviors across frontier and open models and show that many such failures are attributable to identifiable data patterns, including prompt vocabularies, formatting cues, and limited exemplars. The work underscores the data-centric nature of these failures and discusses implications for post-training practice, mitigation strategies, and evaluation, while acknowledging limitations and avenues for future research. Overall, SURF and TURF offer complementary tools for discovering and diagnosing data-induced generalization failures before deployment and for guiding data-curation to improve reliability and trust in large language models.
Abstract
LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.
