LMD3: Language Model Data Density Dependence

John Kirchenbauer; Garrett Honke; Gowthami Somepalli; Jonas Geiping; Daphne Ippolito; Katherine Lee; Tom Goldstein; David Andre

LMD3: Language Model Data Density Dependence

John Kirchenbauer, Garrett Honke, Gowthami Somepalli, Jonas Geiping, Daphne Ippolito, Katherine Lee, Tom Goldstein, David Andre

TL;DR

The paper addresses how the density of training data around a test query in embedding space affects per-example performance of large language models. It proposes LMD3, a KDE-based framework, and validates it through leakage interventions and pretraining-scale analyses, enabled by DEANN for scalable density estimation. Key contributions include formalizing the LMD3 methodology, demonstrating that data density predicts per-sample accuracy and perplexity variance, and outlining practical applications for data attribution, contamination assessment, and targeted data augmentation. This density-centric lens offers a data-driven approach to instance- and group-level error analysis and benchmark integrity at large scales, informing data curation and evaluation practices.

Abstract

We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation. Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density, which is also a significant predictor of the performance increase caused by the intervention. Experiments with pretraining data demonstrate that we can explain a significant fraction of the variance in model perplexity via density measurements. We conclude that our framework can provide statistical evidence of the dependence of a target model's predictions on subsets of its training data, and can more generally be used to characterize the support (or lack thereof) in the training data for a given test task.

LMD3: Language Model Data Density Dependence

TL;DR

Abstract

Paper Structure (43 sections, 2 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 43 sections, 2 equations, 15 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Memorization and Contamination
The "Data Attribution Hypothesis"
The "Similarity Hypothesis"
Preliminaries: Kernel Density Estimation
The LMD3 Methodology
Computing Embeddings
Computing the KDEs
Experiments
Models and Data
Paraphrasing Process
Controlled Experiment 1: Leakage to Increase Density, Finetuning Scale
In-the-Wild: In and Out-of-Distribution Queries, Pretraining Scale
Results
...and 28 more sections

Figures (15)

Figure 1: A system level view of the LMD3 pipeline. The corpus of data used to train a LLM and a test set of queries are projected into a vector space using a neural embedding model. For each resulting query vector, a density estimate with respect to the training corpus is computed. The resulting density estimates can be used to infer the model's ability to respond to a question-like query or simply reproduce the tokens in the query sequence based on whether the relative density is higher or lower at that point in sample space.
Figure 2: Left) To enable the aggregate interpretation of the paraphrasing experiments, we plot accuracy on leaked test questions as a function of the number of "effective epochs" the model has trained on the leaked questions for. Right) We plot the same performance measure as a function of KDE, with gaussian kernel and bandwidth 0.5. We see that the trend in accuracy according to our density measure corresponds with the trend in accuracy according to the known degree of leakage, effective epochs.
Figure 3: Moving from left to right in (a) and (b) shows the effect of an increase in the number of paraphrases of each test question that are leaked into the training data. Experiments in (b) also include an exact copy of each leaked test question, while experiments in (a) do not. "Count" histograms) We plot the distributions of KDE values (gaussian kernel and bandwidth 0.1) for the test queries that were leaked, exactly and or via paraphrase, and not leaked, for each leakage intervention experiment. Accuracy bar charts) We show the corresponding accuracy breakdown for the leaked and non leaked sets for each experiment. Overall, we find that increasing support for test questions via incorporating paraphrases into the training data increases performance on those test questions, and this increase is magnified by the addition of exact leaks of test questions. The addition of the exact copy of each question also makes the leaked and non-leaked question sets highly separable under our KDE measure as demonstrated by the distinct concentration of "leaked Q" KDE values away from $0.0$ in b).
Figure 4: Perplexity according to Pythia 6.9B for random samples from The Deduplicated Pile (ID) as a function of KDE with gaussian kernel and a bandwidth of 0.5, marginalized via equal mass binning into 20 bins. Left) Query perplexity vs. the KDE with respect to a random sample of points in the corpus. Middle) Query perplexity vs. the KDE with respect to only the local neighborhood within the corpus. Right) Query perplexity vs. the average distance to the to the top k neighbors. Horizontal line denotes the average across all queries. We see that while the trend in Query PPL as a function of the random component of the KDE is non-monotonic, and even weakly positive, when considering the local region of highly similar samples for each query, there is a strong clear negative trend in PPL as a function of density, as measured by the local KDE or a simple average over neighbor distances.
Figure 5: Perplexity according to Pythia 6.9B for questions from the MMLU test set (OoD) as a function of KDE with gaussian kernel and a bandwidth of 0.5 or average distance to k nearest neighbors, marginalized via equal mass binning into 20 bins. Left) Question perplexity vs the KDE with respect to only the local neighborhood within the corpus. Middle) Question perplexity vs distance to k nearest neighbors. Right)Response perplexity vs distance to k nearest neighbors. Horizontal line denotes the average across all queries. While the relationship between query perplexity and local KDE isn't particularly strong, there is a stronger trend as a function of simple neighbor distances. For response perplexity, we see a clear trend as a function of average neighbor distances.
...and 10 more figures

Theorems & Definitions (1)

Definition 3.1

LMD3: Language Model Data Density Dependence

TL;DR

Abstract

LMD3: Language Model Data Density Dependence

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (1)