Table of Contents
Fetching ...

Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity

Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh

TL;DR

This paper reframes training-data attribution by asking not which pretraining examples influence an output, but which outputs cannot be traced to any pretraining context—defines this as un-attributability or semantic novelty. It implements a scalable two-stage retrieval pipeline: Stage 1 uses lightweight GIST embeddings with a FAISS index to fetch top-$n$ candidates, and Stage 2 reranks with ColBERTv2 to assess fine-grained semantic similarity; novelty is calibrated against a human-written baseline to gauge relative unattributability. Applied to SmolLM and SmolLM2 on open pretraining corpora, the method reveals that models derive on long contextual spans, novelty varies by task domain, and instruction tuning can increase novelty beyond stylistic changes, all while remaining robust to stylistic shifts. The study provides a scalable, auditable framework for analyzing model generalization at pretraining scale and shares ~20 TB of corpus chunks and indices to support replication and extension of the work.

Abstract

Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm

Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity

TL;DR

This paper reframes training-data attribution by asking not which pretraining examples influence an output, but which outputs cannot be traced to any pretraining context—defines this as un-attributability or semantic novelty. It implements a scalable two-stage retrieval pipeline: Stage 1 uses lightweight GIST embeddings with a FAISS index to fetch top- candidates, and Stage 2 reranks with ColBERTv2 to assess fine-grained semantic similarity; novelty is calibrated against a human-written baseline to gauge relative unattributability. Applied to SmolLM and SmolLM2 on open pretraining corpora, the method reveals that models derive on long contextual spans, novelty varies by task domain, and instruction tuning can increase novelty beyond stylistic changes, all while remaining robust to stylistic shifts. The study provides a scalable, auditable framework for analyzing model generalization at pretraining scale and shares ~20 TB of corpus chunks and indices to support replication and extension of the work.

Abstract

Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm

Paper Structure

This paper contains 28 sections, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Embedding similarity is more robust to long or paraphrased texts than N-gram similarity. Comparison of similarity measured with N-gram overlap (left) and embedding cosine similarity (right) with increasing sequence length. The similarity is measured between a model generation and its top 2 closest semantic matches in the pretraining corpus retrieved using our test. Both training excerpts convey the same information as the generation, but lexical overlap fails to recognize this with larger N-grams, whereas embeddings remain robust.
  • Figure 2: Pipeline for scoring the novelty of an LLM output $q$. We test whether $q$ is unattributable to the pretraining corpus -- our operational definition of novelty. Stage 0 (one-time): Chunk the corpus, compute L2-normalized GIST solatorio2024gistembed embeddings, and build a cosine-similarity FAISS douze2024faiss index. Stage 1: Embed $q$ with GIST and retrieve the top-$n$ nearest corpus chunks. Stage 2: Rerank retrieved candidates with ColBERTv2 santhanam2022colbertv2effectiveefficientretrieval at multiple chunk sizes. The novelty score is the median, over $q$'s chunks, of the ColBERTv2 similarity to the best retrieved chunk, normalized by the sequence length and corresponding baseline score.
  • Figure 3: Median ColBERTv2 similarity of SmolLM (top) and SmolLM2 (bottom) generations, reported relative to a human baseline (Dolma). Values: 0 = human baseline, 0.5 = 50% higher than human, $-0.1$ = 10% lower than human. Higher similarity indicates lower novelty.
  • Figure 4: Median ColBERTv2 similarity of SmolLM (top) and SmolLM2 (bottom) generations on domain‑specific benchmarks. Only correct samples are included. For GSM8K and TruthfulQA, the targets serve as the baseline. For OpenRewriteEval (LLM‑generated targets), Dolma is the baseline, matching the open‑ended writing task. Values are relative to the baseline: 0 = human baseline, 0.5 = 50% higher than human, $-0.1$ = 10% lower than human. Higher similarity indicates lower novelty.
  • Figure 5: Number of times each original FAISS-Top-100 index was mapped to the ColBERTv2-reranked top index (index $0$), which was used for the novelty analysis in Section \ref{['sec:experiments']}. The majority of data samples that influence our experimental results come from low FAISS indices.
  • ...and 2 more figures