Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity
Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh
TL;DR
This paper reframes training-data attribution by asking not which pretraining examples influence an output, but which outputs cannot be traced to any pretraining context—defines this as un-attributability or semantic novelty. It implements a scalable two-stage retrieval pipeline: Stage 1 uses lightweight GIST embeddings with a FAISS index to fetch top-$n$ candidates, and Stage 2 reranks with ColBERTv2 to assess fine-grained semantic similarity; novelty is calibrated against a human-written baseline to gauge relative unattributability. Applied to SmolLM and SmolLM2 on open pretraining corpora, the method reveals that models derive on long contextual spans, novelty varies by task domain, and instruction tuning can increase novelty beyond stylistic changes, all while remaining robust to stylistic shifts. The study provides a scalable, auditable framework for analyzing model generalization at pretraining scale and shares ~20 TB of corpus chunks and indices to support replication and extension of the work.
Abstract
Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
