Table of Contents
Fetching ...

Scale Dependent Data Duplication

Joshua Kazdan, Noam Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho

TL;DR

This work derives explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus, and identifies and resolve an unstudied source of scale-dependence, allowing more accurate prediction at scale.

Abstract

Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.

Scale Dependent Data Duplication

TL;DR

This work derives explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus, and identifies and resolve an unstudied source of scale-dependence, allowing more accurate prediction at scale.

Abstract

Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.
Paper Structure (38 sections, 1 theorem, 59 equations, 10 figures, 1 algorithm)

This paper contains 38 sections, 1 theorem, 59 equations, 10 figures, 1 algorithm.

Key Result

Proposition 5.2

Under eq:theory_corr_model and uniform sampling over $K$ classes, Equivalently, the averaged gradient behaves like an iid average with effective sample size

Figures (10)

  • Figure 1: Semantic-preserving transformations yield more aligned gradients for larger/stronger models. We sample $N{=}1000$ FineWeb-Edu-Dedup documents and compute per-document gradients of normalized next-token cross-entropy (Eq. \ref{['eq:doc-loss']}) for each model. We report mean cosine similarity between (i) unrelated document pairs (negative baseline) and (ii) each document and its transformed counterpart (positives), including translations and light surface perturbations. Smaller/weaker models exhibit gradient similarity dominated by surface cues (language/casing), often failing to separate positives from negatives. As capability increases, positives become consistently more aligned than the negative baseline. Error bars show per-document standard deviation. Per-model-family results are in \ref{['fig:grad-sim-ind-family']}.
  • Figure 2: Semantic sensitivity emerges over training and is accelerated by scale. For a fixed model family, we compute AUC to detect whether a candidate gradient corresponds to a semantic-preserving transformation of the same document versus an unrelated document, with cosine similarity to the original document gradient as the score. Early in training, AUC remains near $0.5$ because gradients are dominated by surface-form features (language/casing). With additional optimizer steps, AUC increases, indicating that gradients increasingly reflect semantic content. Larger models reach a given AUC with fewer steps.
  • Figure 3: NN cosine similarity scaling deviates sharply at large corpus sizes. We embed 190M FineWeb-Edu-Dedup documents with EmbeddingGemma-300m and sample subsets of size ranging from $10^4$-$10^8$ without replacement. For each $N$, we estimate the mean nearest-neighbor cosine similarity using FAISS. Dashed lines show best-fit power laws over the small-$N$ regime where the uniform/vMF null predicts $\mathbb{E}[\Delta_i]\propto N^{-2/d}$. Beyond a scale threshold, the empirical curve steepens (smaller gaps than predicted), indicating substantially more near neighbors than expected under isotropic baselines.
  • Figure 4: Tail collision rates accelerate with dataset size. For fixed thresholds $T$, we estimate the fraction of points with nearest-neighbor similarity $M_i \ge T$. These increase exponentially, as predicted under an isotropic baseline.
  • Figure 5: Nearest-neighbor cosine similarity scaling laws collapse an order of magnitude earlier for synthetic datasets: We embed the fully-synthetic pretraining dataset Recycling-the-Web nguyen2025recyclingwebmethodenhance and find that the scaling law discovered in \ref{['fig:nn-gap-scaling']} occurs an order of magnitude earlier for synthetic data, suggesting that the diversity of synthetic pretraining datasets should be improved.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 5.1: Effective duplicates
  • Proposition 5.2: Saturation of independent training signal
  • Remark 5.3: Identifiability from mean NN cosine