Exploring the limits of strong membership inference attacks on large language models

Jamie Hayes; Ilia Shumailov; Christopher A. Choquette-Choo; Matthew Jagielski; George Kaissis; Milad Nasr; Sahra Ghalebikesabi; Meenatchi Sundaram Mutu Selva Annamalai; Niloofar Mireshghallah; Igor Shilov; Matthieu Meeus; Yves-Alexandre de Montjoye; Katherine Lee; Franziska Boenisch; Adam Dziedzic; A. Feder Cooper

Exploring the limits of strong membership inference attacks on large language models

Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Milad Nasr, Sahra Ghalebikesabi, Meenatchi Sundaram Mutu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

TL;DR

<3-5 sentence high-level summary> This paper confronts the privacy risk of strong membership inference attacks on large pre-trained language models by scaling LiRA to GPT-2–like references up to 1B parameters and training on massive C4-derived data. It demonstrates that strong MIAs can surpass random baselines but remain limited under practical conditions (AUC typically below 0.7), and that many per-sample decisions are highly unstable or coin-flip-like due to training randomness. The authors also show a non-straightforward relationship between MIA success and other privacy metrics, with later training stages increasing vulnerability but not guaranteeing reliable inference for individual samples. Collectively, the work provides a large-scale benchmark that bounds the effectiveness of current strong MIAs on contemporary LLMs and highlights the need for more robust, sample-focused privacy assessments and defenses.

Abstract

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.

Exploring the limits of strong membership inference attacks on large language models

TL;DR

Abstract

Exploring the limits of strong membership inference attacks on large language models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (32)