High-Power Training Data Identification with Provable Statistical Guarantees

Zhenlong Liu; Hao Zeng; Weiran Huang; Hongxin Wei

High-Power Training Data Identification with Provable Statistical Guarantees

Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei

TL;DR

This work introduces Provable Training Data Identification (PTDI), a distribution-free framework for identifying training data points with provable control over the false discovery rate ($\mathrm{FDR}$). PTDI leverages conformal p-values derived from a non-training calibration set, scales them by a data-usage proportion estimate, and applies the Benjamini–Hochberg procedure to obtain a data-dependent threshold, enabling strict $\mathrm{FDR}$ control and enhanced power. A key contribution is the subtraction estimator for $\pi_{\text{test}}$, which conservatively estimates the proportion of training data in the test set and improves power while maintaining guarantees; an additional adjusted-moment estimator further boosts performance when some confirmed members are known. Extensive experiments across LLMs and VLMs show PTDI consistently controls $\mathrm{FDR}$ below target levels and outperforms existing methods like KTD in several settings, demonstrating practical applicability in pre-training and fine-tuning scenarios with diverse scores and datasets.

Abstract

Identifying training data within large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power.

High-Power Training Data Identification with Provable Statistical Guarantees

TL;DR

This work introduces Provable Training Data Identification (PTDI), a distribution-free framework for identifying training data points with provable control over the false discovery rate (

). PTDI leverages conformal p-values derived from a non-training calibration set, scales them by a data-usage proportion estimate, and applies the Benjamini–Hochberg procedure to obtain a data-dependent threshold, enabling strict

control and enhanced power. A key contribution is the subtraction estimator for

, which conservatively estimates the proportion of training data in the test set and improves power while maintaining guarantees; an additional adjusted-moment estimator further boosts performance when some confirmed members are known. Extensive experiments across LLMs and VLMs show PTDI consistently controls

below target levels and outperforms existing methods like KTD in several settings, demonstrating practical applicability in pre-training and fine-tuning scenarios with diverse scores and datasets.

High-Power Training Data Identification with Provable Statistical Guarantees

TL;DR

Abstract

High-Power Training Data Identification with Provable Statistical Guarantees

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (10)