Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models
Sushant Mehta
TL;DR
This work tackles benchmark contamination in synthetic data used to train foundation models by proposing a four-level hierarchical detection framework that targets token-level, semantic-level, reasoning-pattern, and performance-cliff signals. Through controlled experiments on MMLU, GSM8K, and HumanEval, the approach uncovers semantic contamination that prior methods miss and delivers a mean F1 improvement of $26.5\%$ over state-of-the-art baselines. The framework combines Min-K% Prob at the token level, embedding clustering and distributional analysis for semantic signals, chain-of-thought tracing for reasoning patterns, and paraphrase-based performance cliffs to validate contamination. The authors provide practical guidance for audit pipelines, discuss limitations and future work, and argue that implementing such detection is essential for responsible and trustworthy use of synthetic data in large-scale foundation-model training.
Abstract
Synthetic data has become essential for training foundation models, yet benchmark contamination threatens evaluation integrity. Although existing detection methods identify token-level overlap, they fail to detect semantic-level contamination where synthetic data conceptually resemble benchmarks without lexical overlap. This gap is critical as foundation models increasingly train on synthetic data that may implicitly encode benchmark knowledge. We propose a hierarchical contamination detection framework operating at four levels: token level, semantic level, reasoning pattern, and performance cliff detection. Through controlled experiments on MMLU, GSM8K and HumanEval, we demonstrate that semantic-level contamination evades existing methods (F1=0.17-0.49) but is effectively detected by our hierarchical approach (F1 = 0.76), with an average improvement of 26. 5\% over state-of-the-art baselines. Our framework provides practitioners with practical tools for audit pipelines and enables responsible deployment of synthetic training data.
