Table of Contents
Fetching ...

Exploring the limits of strong membership inference attacks on large language models

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

TL;DR

<3-5 sentence high-level summary> This paper confronts the privacy risk of strong membership inference attacks on large pre-trained language models by scaling LiRA to GPT-2–like references up to 1B parameters and training on massive C4-derived data. It demonstrates that strong MIAs can surpass random baselines but remain limited under practical conditions (AUC typically below 0.7), and that many per-sample decisions are highly unstable or coin-flip-like due to training randomness. The authors also show a non-straightforward relationship between MIA success and other privacy metrics, with later training stages increasing vulnerability but not guaranteeing reliable inference for individual samples. Collectively, the work provides a large-scale benchmark that bounds the effectiveness of current strong MIAs on contemporary LLMs and highlights the need for more robust, sample-focused privacy assessments and defenses.

Abstract

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.

Exploring the limits of strong membership inference attacks on large language models

TL;DR

<3-5 sentence high-level summary> This paper confronts the privacy risk of strong membership inference attacks on large pre-trained language models by scaling LiRA to GPT-2–like references up to 1B parameters and training on massive C4-derived data. It demonstrates that strong MIAs can surpass random baselines but remain limited under practical conditions (AUC typically below 0.7), and that many per-sample decisions are highly unstable or coin-flip-like due to training randomness. The authors also show a non-straightforward relationship between MIA success and other privacy metrics, with later training stages increasing vulnerability but not guaranteeing reliable inference for individual samples. Collectively, the work provides a large-scale benchmark that bounds the effectiveness of current strong MIAs on contemporary LLMs and highlights the need for more robust, sample-focused privacy assessments and defenses.

Abstract

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.

Paper Structure

This paper contains 90 sections, 107 equations, 32 figures, 7 tables.

Figures (32)

  • Figure 1: LiRA with different references. We attack a $140$M model trained on ${\approx}7$M samples. As references increase, LiRA's performance improves (measured with $\mathrm{ROC{\text{-}}AUC}$). However, there are diminishing returns: $\mathrm{AUC}$ is effectively unchanged from $128$ to $256$ IN references.
  • Figure 2: MIA vulnerability for compute-optimally trained models. We show attacks on $6$ models of different sizes under Chinchilla-optimal conditions for 1 epoch, using $128$ references. (a) $\mathrm{ROC}$ curves demonstrate varying MIA susceptibility for $10$M ($\mathrm{AUC}{=}0.592$), $85$M ($\mathrm{AUC}{=}0.699$), $302$M ($\mathrm{AUC}{=}0.689$), $489$M ($\mathrm{AUC}{=}0.547$), $604$M ($\mathrm{AUC}{=}0.654$) and $1018$M ($\mathrm{AUC}{=}0.553$). The $85$M and $302$M models show the highest vulnerability, indicating that increasing model size does not uniformly decrease MIA risk in this setting. (b) How $\mathrm{TPR}$ for each fixed $\mathrm{FPR}$ varies by model size.
  • Figure 3: Studying the effect of varying epochs. (a) We compare attacking a $44$M model trained on the whole Chinchilla-optimal dataset in $1$ epoch ($\mathrm{AUC}{=}0.620$ after $1$ of $1$ epoch) to training for $2$ epochs on only half of the dataset ($\mathrm{AUC}{=}0.744$ after $2$ of $2$ epochs). (b) We attack a $140$M model trained on the whole Chinchilla-optimal dataset for $10$ epochs. $\mathrm{AUC}$ increases with more epochs.
  • Figure 4: Varying sizes of training dataset and model ($1$ epoch). (a) We attack $140$M models trained on different-size datasets ($50$K to $10$M samples). MIA success does not monotonically increase with dataset size. (b) We attack different-size models trained on a fixed dataset size (${\approx}8.3$M samples), and plot how $\mathrm{TPR}$ varies at fixed $\mathrm{FPR}$. MIA success monotonically increases with model size.
  • Figure 5: Visualizing decision instability. We train $B{=}127$ targets for the ${302}$M model on $2^{19}$ samples, and one set of $128$ references to use for all attacks. LiRA achieves stable mean $\mathrm{AUC}{=}0.752\pm0.007$, yet many per-sample decisions behave like coin flips. (left) Share of coin-flip-like decisions across $\mathrm{FPR}$ ($\log$-scale for small $\mathrm{FPR}$; $\widehat{\mathrm{flip}}_{\eta, B}{\gtrsim}0.487$, the $\alpha{=}0.05$ cutoff, see Appendix \ref{['app:sec:instability:flip:arbitrary']}). Members exhibit more coin-flip-like decisions than non-members. (right) A representative unstable member: IN/OUT distributions overlap; the $B$ target decisions flip because ${\bm{x}}$'s seed-specific scores lie near the respective seed‑specific thresholds (Appendix \ref{['app:sec:instability:flip:results']}).
  • ...and 27 more figures