Table of Contents
Fetching ...

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data

Jie Zhang, Debeshee Das, Gautam Kamath, Florian Tramèr

TL;DR

The paper critiques the use of membership inference attacks as proofs that a dataset was used to train large, production-scale models, arguing that the core requirement—bounding the false positive rate under an unknown null hypothesis—is infeasible. It formalizes a hypothesis-testing framework and shows that counterfactual sampling in real-world settings cannot reliably distinguish training data usage, exposing distribution-shift and non-IID challenges. To remedy this, the authors propose three sound paradigms: injecting randomly sampled canaries with rank-based testing, watermarking data to enable auditable traces, and data extraction attacks that recover training data portions directly. Collectively, these approaches offer robust, interpretable, and practically applicable means to demonstrate data usage, while maintaining rigorous false-positive control in real-world contexts. The work also delineates scenarios where MI remains appropriate (privacy defense evaluation and DP auditing) and underscores the need for auditable proofs in litigation and policy.

Abstract

We consider the problem of a training data proof, where a data creator or owner wants to demonstrate to a third party that some machine learning model was trained on their data. Training data proofs play a key role in recent lawsuits against foundation models trained on web-scale data. Many prior works suggest to instantiate training data proofs using membership inference attacks. We argue that this approach is fundamentally unsound: to provide convincing evidence, the data creator needs to demonstrate that their attack has a low false positive rate, i.e., that the attack's output is unlikely under the null hypothesis that the model was not trained on the target data. Yet, sampling from this null hypothesis is impossible, as we do not know the exact contents of the training set, nor can we (efficiently) retrain a large foundation model. We conclude by offering two paths forward, by showing that data extraction attacks and membership inference on special canary data can be used to create sound training data proofs.

Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data

TL;DR

The paper critiques the use of membership inference attacks as proofs that a dataset was used to train large, production-scale models, arguing that the core requirement—bounding the false positive rate under an unknown null hypothesis—is infeasible. It formalizes a hypothesis-testing framework and shows that counterfactual sampling in real-world settings cannot reliably distinguish training data usage, exposing distribution-shift and non-IID challenges. To remedy this, the authors propose three sound paradigms: injecting randomly sampled canaries with rank-based testing, watermarking data to enable auditable traces, and data extraction attacks that recover training data portions directly. Collectively, these approaches offer robust, interpretable, and practically applicable means to demonstrate data usage, while maintaining rigorous false-positive control in real-world contexts. The work also delineates scenarios where MI remains appropriate (privacy defense evaluation and DP auditing) and underscores the need for auditable proofs in litigation and policy.

Abstract

We consider the problem of a training data proof, where a data creator or owner wants to demonstrate to a third party that some machine learning model was trained on their data. Training data proofs play a key role in recent lawsuits against foundation models trained on web-scale data. Many prior works suggest to instantiate training data proofs using membership inference attacks. We argue that this approach is fundamentally unsound: to provide convincing evidence, the data creator needs to demonstrate that their attack has a low false positive rate, i.e., that the attack's output is unlikely under the null hypothesis that the model was not trained on the target data. Yet, sampling from this null hypothesis is impossible, as we do not know the exact contents of the training set, nor can we (efficiently) retrain a large foundation model. We conclude by offering two paths forward, by showing that data extraction attacks and membership inference on special canary data can be used to create sound training data proofs.
Paper Structure (31 sections, 1 theorem, 16 equations, 7 figures)

This paper contains 31 sections, 1 theorem, 16 equations, 7 figures.

Key Result

Lemma 1

Assume that $x$ is sampled uniformly at random from $\mathcal{X}$independently of the creation of the training set $D$ and model training $f \sim \texttt{Train}(D)$. Then, the FPR in eq:fpr_rank satisfies:

Figures (7)

  • Figure 1: In a training data proof, a data creator aims to convince a third party (e.g., a judge) that a machine learning model was trained on their data. A common proposal in the literature is to use membership inference attacks for this purpose.
  • Figure 2: If we try to estimate the FPR of a training data proof by collecting non-members after the model's cutoff date (e.g., as in shi2023detectingmeeus2024did), we estimate the model's behavior on a distribution that differs significantly from the true null hypothesis.
  • Figure 3: If we collect targets for training data proofs close to the model's cutoff date (as suggested in meeus2024inherent), we run the risk of focusing our efforts on targets that have no chance of being members, because some data sources could have been collected further in the past.
  • Figure 4: If we want to perform a training data proof by comparing the model's behavior on a piece of data to plausible counterfactualsmainidi2024 (in this fictitious case, possible alternative names that J.K. Rowling could have sampled in lieu of "Harry Potter"), we need to ensure that the act of publishing the data did not causally impact other parts of the model's training data (that are not part of the claimed training data proof).
  • Figure 5: Injecting a specially crafted canary into a news article, e.g., a hidden message in the HTML code.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof