Table of Contents
Fetching ...

How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence

Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li

TL;DR

Dataset contamination inflates LLM benchmark performance when pretraining data overlaps evaluation sets. The authors propose Kernel Divergence Score (KDS), which compares kernel similarity matrices of sample embeddings before and after fine-tuning, using an RBF kernel with bandwidth mma and the normalizer $E$ to define $S(mathcal{D}, mathcal{M}) = - \frac{1}{E} \sum_{i,j} \left| \Phi(Z)_{ij} \log \frac{\Phi(Z)_{ij}}{\Phi(Z')_{ij}} \right|$. Empirically, KDS achieves near-perfect monotonicity with contamination rate lambda across multiple datasets and model families, outperforms baselines, and remains robust to kernel choice, bandwidth, and embedding location, with ablations underscoring the importance of fine-tuning signals. This kernel-based, model-information-driven approach provides a reliable, interpretable measure of dataset leakage that can guide benchmark curation and improve the reliability of generalization assessments. The work also discusses temporal-shift considerations and outlines practical extensions, including PU-learning ideas and kernel calibration for broader applicability.

Abstract

Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings. Code is released in https://github.com/deeplearning-wisc/kernel-divergence-score.

How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence

TL;DR

Dataset contamination inflates LLM benchmark performance when pretraining data overlaps evaluation sets. The authors propose Kernel Divergence Score (KDS), which compares kernel similarity matrices of sample embeddings before and after fine-tuning, using an RBF kernel with bandwidth mma and the normalizer to define . Empirically, KDS achieves near-perfect monotonicity with contamination rate lambda across multiple datasets and model families, outperforms baselines, and remains robust to kernel choice, bandwidth, and embedding location, with ablations underscoring the importance of fine-tuning signals. This kernel-based, model-information-driven approach provides a reliable, interpretable measure of dataset leakage that can guide benchmark curation and improve the reliability of generalization assessments. The work also discusses temporal-shift considerations and outlines practical extensions, including PU-learning ideas and kernel calibration for broader applicability.

Abstract

Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings. Code is released in https://github.com/deeplearning-wisc/kernel-divergence-score.

Paper Structure

This paper contains 35 sections, 17 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of the proposed Kernel Divergence Score (KDS) framework for measuring dataset contamination in large language models. The process involves extracting sample embeddings from the model before and after fine-tuning on the benchmark dataset $\mathcal{D}$, computing the kernel similarity matrix for each stage, and measuring the divergence between the two matrices $\Phi(Z)$ and $\Phi(Z')$. By capturing the changes of embeddings induced by fine-tuning, KDS provides a reliable and interpretable score to evaluate the level of dataset contamination.
  • Figure 2: Decomposition of the Kernel Divergence Score. Each component of the Kernel Divergence Score function is shown. $\Phi(\cdot)$ denotes the kernel similarity matrix, $Z$ and $Z'$ represent normalized sample embeddings before and after fine-tuning, and $\odot$ is the Hadamard product. Score and embeddings are based on Llama-3.1-8B-Instruct dubey2024llama. (Left) shows that the original kernel similarity matrix before fine-tuning. Note, that diagonal values are zeroed for better visualization, because all diagonal values are 1 in RBF kernels. (Middle) reveals that fine-tuning alters relationships among unseen samples more than those among seen samples. (Right) Combining the two panels enhances the distinction between seen and unseen samples, thereby enabling a more reliable measurement of contamination levels.
  • Figure 3: Trend of Kernel Divergence Scores on WikiMIA. The score shows monotonic increase with respect to contamination rate, and the standard deviation over 5 runs is low.
  • Figure 4: Scoring performance across embedding location. Correlation coefficients from different layers are retrieved using Mistral-7B-Instruct-v0.2 on WikiMIA.
  • Figure 5: Decomposition of the Kernel Divergence Score - full list.
  • ...and 2 more figures