High-Power Training Data Identification with Provable Statistical Guarantees
Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei
TL;DR
This work introduces Provable Training Data Identification (PTDI), a distribution-free framework for identifying training data points with provable control over the false discovery rate ($\mathrm{FDR}$). PTDI leverages conformal p-values derived from a non-training calibration set, scales them by a data-usage proportion estimate, and applies the Benjamini–Hochberg procedure to obtain a data-dependent threshold, enabling strict $\mathrm{FDR}$ control and enhanced power. A key contribution is the subtraction estimator for $\pi_{\text{test}}$, which conservatively estimates the proportion of training data in the test set and improves power while maintaining guarantees; an additional adjusted-moment estimator further boosts performance when some confirmed members are known. Extensive experiments across LLMs and VLMs show PTDI consistently controls $\mathrm{FDR}$ below target levels and outperforms existing methods like KTD in several settings, demonstrating practical applicability in pre-training and fine-tuning scenarios with diverse scores and datasets.
Abstract
Identifying training data within large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power.
