Table of Contents
Fetching ...

Chunky Post-Training: Data Driven Failures of Generalization

Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, Sara Price

TL;DR

Chunky post-training identifies data-driven, surface-feature routing as a class of generalization failures arising from discrete training data chunks. The authors introduce SURF, a black-box auditor that surfaces rubric-violating behaviors by iteratively reweighting semantic prompt attributes, and TURF, a data-attribution pipeline that traces those behaviors back to training-data features. They demonstrate widespread chunky behaviors across frontier and open models and show that many such failures are attributable to identifiable data patterns, including prompt vocabularies, formatting cues, and limited exemplars. The work underscores the data-centric nature of these failures and discusses implications for post-training practice, mitigation strategies, and evaluation, while acknowledging limitations and avenues for future research. Overall, SURF and TURF offer complementary tools for discovering and diagnosing data-induced generalization failures before deployment and for guiding data-curation to improve reliability and trust in large language models.

Abstract

LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.

Chunky Post-Training: Data Driven Failures of Generalization

TL;DR

Chunky post-training identifies data-driven, surface-feature routing as a class of generalization failures arising from discrete training data chunks. The authors introduce SURF, a black-box auditor that surfaces rubric-violating behaviors by iteratively reweighting semantic prompt attributes, and TURF, a data-attribution pipeline that traces those behaviors back to training-data features. They demonstrate widespread chunky behaviors across frontier and open models and show that many such failures are attributable to identifiable data patterns, including prompt vocabularies, formatting cues, and limited exemplars. The work underscores the data-centric nature of these failures and discusses implications for post-training practice, mitigation strategies, and evaluation, while acknowledging limitations and avenues for future research. Overall, SURF and TURF offer complementary tools for discovering and diagnosing data-induced generalization failures before deployment and for guiding data-curation to improve reliability and trust in large language models.

Abstract

LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.
Paper Structure (54 sections, 6 equations, 29 figures, 2 tables, 1 algorithm)

This paper contains 54 sections, 6 equations, 29 figures, 2 tables, 1 algorithm.

Figures (29)

  • Figure 1: In (a) we identify a failure of a model to generalize its training signal correctly. The model applies "rebut" behavior to a query based on some feature of the prompt (but not its knowledge of the whether the statement is true). In (b) we show an overview of our tooling to find and attribute generalization routing issues. In Section \ref{['section:finding-failures']} we introduce SURF (Surfacing Unintended Response Failures), a pipeline for finding failures of generalization. In Section \ref{['section:data-att']} we use TURF (Tracing Unintended Responses via Features) to match generalization failures with the post-training data which induced them.
  • Figure 2: An array of frontier model behaviors found using SURF. We see Gemini staying task focused in response to the user's highly personal code comments. GPT generates code when given some conditionals. Sonnet 4.5 refuses a benign query because it involves financial terms like "invoice" and "voucher".
  • Figure 3: The main components of SURF. The input to the loop is a rubric specifying the behavior to search for. The algorithm works by iteratively reweighting its attribute pool based on the prompts which scored highest against the rubric. The attributes sampled from to generate the next round of queries.
  • Figure 4: GPT refuses a genuine request involving the Holocaust. Opus takes a strange interpretation of a user's worrying request, ignoring the emotional content.
  • Figure 5: The top 45 prompts from each pipeline were perturbed 20 times and each perturbation is sampled 100 times. We plot the average rate of incorrect behavioral routing of prompts upon resampling. Across these nine experiments, prompts were relatively insensitive to small changes in phrasing.
  • ...and 24 more figures