Table of Contents
Fetching ...

LLM Dataset Inference: Did you train on my dataset?

Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic

TL;DR

The paper critiques prior membership inference attacks (MIAs) on LLMs, showing their apparent success is driven by temporal distribution shifts rather than true data membership. It reframes data-attribution as dataset inference, combining multiple MIAs and using a statistical test to detect whether a dataset was used in training. Through experiments on the Pythia models trained on the Pile, the authors show dataset inference can distinguish train vs. validation splits with p-values below 0.1 and no false positives when comparing validation subsets. They also argue for IID setups, multiple distributions, and careful false-positive analysis, and outline an operational framework involving an arbiter for real-world copyright disputes. The work provides a practically viable tool for dataset attribution and highlights limitations of per-example MIAs in the wild.

Abstract

The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent works have presented methods to identify if individual text sequences were members of the model's training data, known as membership inference attacks (MIAs). We demonstrate that the apparent success of these MIAs is confounded by selecting non-members (text sequences not used for training) belonging to a different distribution from the members (e.g., temporally shifted recent Wikipedia articles compared with ones used to train the model). This distribution shift makes membership inference appear successful. However, most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.

LLM Dataset Inference: Did you train on my dataset?

TL;DR

The paper critiques prior membership inference attacks (MIAs) on LLMs, showing their apparent success is driven by temporal distribution shifts rather than true data membership. It reframes data-attribution as dataset inference, combining multiple MIAs and using a statistical test to detect whether a dataset was used in training. Through experiments on the Pythia models trained on the Pile, the authors show dataset inference can distinguish train vs. validation splits with p-values below 0.1 and no false positives when comparing validation subsets. They also argue for IID setups, multiple distributions, and careful false-positive analysis, and outline an operational framework involving an arbiter for real-world copyright disputes. The work provides a practically viable tool for dataset attribution and highlights limitations of per-example MIAs in the wild.

Abstract

The proliferation of large language models (LLMs) in the real world has come with a rise in copyright cases against companies for training their models on unlicensed data from the internet. Recent works have presented methods to identify if individual text sequences were members of the model's training data, known as membership inference attacks (MIAs). We demonstrate that the apparent success of these MIAs is confounded by selecting non-members (text sequences not used for training) belonging to a different distribution from the members (e.g., temporally shifted recent Wikipedia articles compared with ones used to train the model). This distribution shift makes membership inference appear successful. However, most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.
Paper Structure (46 sections, 2 equations, 11 figures)

This paper contains 46 sections, 2 equations, 11 figures.

Figures (11)

  • Figure 1: LLM Dataset Inference.Stage 0: Victim approaches an LLM provider. The victim's data consists of the suspect and validation (Val) sets. A victim claims that the suspect set of data points was potentially used to train the LLM. The validation set is private to the victim, such as unpublished data (e.g., drafts of articles, blog posts, or books) from the same distribution as the suspect set. Both sets are divided into non-overlapping splits (partitions) A and B. Stage 1: Aggregate Features with MIAs. The A splits from suspect and validation sets are passed through the LLM to obtain their features, which are scores generated from various MIAs for LLMs. Stage 2: Learn Correlations (between features and their membership status). We train a linear model using the extracted features to assign label 0 (denoting potential members of the LLM) to the suspect and label 1 (representing non-members) to the validation features. The goal is to identify useful MIAs. Stage 3: Perform Dataset Inference. We use the B splits of the suspect and validation sets, (i) perform MIAs on them for the suspect LLM to obtain features, (ii) then obtain an aggregated confidence score using the previously trained linear model, and (iii) apply a statistical T-Test on the obtained scores. For the suspect data points that are members, their confidence scores are significantly closer to 0 than for the non-members.
  • Figure 2: Comparative analysis of the Min-k% Probshi2024detecting. We measure the performance (a) across different model sizes and (b) the observed reversal effect. The method performs close to a random guess on non-members from the Pile validation sets.
  • Figure 3: Performance of various MIAs on different subsets of the Pile dataset. We report 6 different MIAs based on the best performing ones across various categories like reference based, and perturbation based methods (\ref{['sec:mia-methods']}). An effective MIA must have an AUC much greater than 0.5. Few methods meet this criterion for specific datasets, but the success is not consistent across datasets.
  • Figure 4: p-values of dataset inference By applying dataset inference to Pythia-12b models with 1000 data points, we observe that we can correctly distinguish train and validation splits of the PILE with very low p-values (always below 0.1). Also, when considering false positives for comparing two validation subsets, we observe a p-value higher than 0.1 in all cases, indicating no false positives.
  • Figure 5: Ablation study for dataset inference. We analyze which features based on or derived from the previous membership inference methods increase the success of dataset inference. (a) Our results indicate that no single feature contributes consistently, thus we need a linear model to selectively aggregate their impact on the final outputs from dataset inference. (b) Given the selected features, we consider different ways of how to pre-process them before building the classifier. The proposed method (denoted as Removal (Norm.)) removes outliers and normalizes the feature values. (c) We evaluate the selected and pre-processed features using suspect set that come from the validation data. We do not observe any false positives as shown in the last row in (c).
  • ...and 6 more figures