Table of Contents
Fetching ...

Order of Magnitude Speedups for LLM Membership Inference

Rongting Zhang, Martin Bertran, Aaron Roth

TL;DR

This work tackles the high computational cost of traditional shadow-model-based membership inference attacks (MIAs) on large language models (LLMs) by introducing an ensemble of small quantile regression models to calibrate MIA thresholds without reproducing full model training. By modeling the null-score distribution with mu(x) and sigma(x) and deriving a dynamic threshold q_{1-α}(x), the approach enables effective MIAs with only a fraction of the compute required by LiRA. Empirical results across AG News, WikiText-103, and XSum show competitive or superior true positive rates at very low false positive rates, with up to roughly $6\%$ of LiRA’s compute, and robust performance across model families (Pythia, OPT, Llama) and tokenizers. These findings support using calibrated, ensemble quantile regression as a practical privacy auditing tool for deployed LLMs, facilitating routine assessments of memorization risk and privacy leakage.

Abstract

Large Language Models (LLMs) have the promise to revolutionize computing broadly, but their complexity and extensive training data also expose significant privacy vulnerabilities. One of the simplest privacy risks associated with LLMs is their susceptibility to membership inference attacks (MIAs), wherein an adversary aims to determine whether a specific data point was part of the model's training set. Although this is a known risk, state of the art methodologies for MIAs rely on training multiple computationally costly shadow models, making risk evaluation prohibitive for large models. Here we adapt a recent line of work which uses quantile regression to mount membership inference attacks; we extend this work by proposing a low-cost MIA that leverages an ensemble of small quantile regression models to determine if a document belongs to the model's training set or not. We demonstrate the effectiveness of this approach on fine-tuned LLMs of varying families (OPT, Pythia, Llama) and across multiple datasets. Across all scenarios we obtain comparable or improved accuracy compared to state of the art shadow model approaches, with as little as 6% of their computation budget. We demonstrate increased effectiveness across multi-epoch trained target models, and architecture miss-specification robustness, that is, we can mount an effective attack against a model using a different tokenizer and architecture, without requiring knowledge on the target model.

Order of Magnitude Speedups for LLM Membership Inference

TL;DR

This work tackles the high computational cost of traditional shadow-model-based membership inference attacks (MIAs) on large language models (LLMs) by introducing an ensemble of small quantile regression models to calibrate MIA thresholds without reproducing full model training. By modeling the null-score distribution with mu(x) and sigma(x) and deriving a dynamic threshold q_{1-α}(x), the approach enables effective MIAs with only a fraction of the compute required by LiRA. Empirical results across AG News, WikiText-103, and XSum show competitive or superior true positive rates at very low false positive rates, with up to roughly of LiRA’s compute, and robust performance across model families (Pythia, OPT, Llama) and tokenizers. These findings support using calibrated, ensemble quantile regression as a practical privacy auditing tool for deployed LLMs, facilitating routine assessments of memorization risk and privacy leakage.

Abstract

Large Language Models (LLMs) have the promise to revolutionize computing broadly, but their complexity and extensive training data also expose significant privacy vulnerabilities. One of the simplest privacy risks associated with LLMs is their susceptibility to membership inference attacks (MIAs), wherein an adversary aims to determine whether a specific data point was part of the model's training set. Although this is a known risk, state of the art methodologies for MIAs rely on training multiple computationally costly shadow models, making risk evaluation prohibitive for large models. Here we adapt a recent line of work which uses quantile regression to mount membership inference attacks; we extend this work by proposing a low-cost MIA that leverages an ensemble of small quantile regression models to determine if a document belongs to the model's training set or not. We demonstrate the effectiveness of this approach on fine-tuned LLMs of varying families (OPT, Pythia, Llama) and across multiple datasets. Across all scenarios we obtain comparable or improved accuracy compared to state of the art shadow model approaches, with as little as 6% of their computation budget. We demonstrate increased effectiveness across multi-epoch trained target models, and architecture miss-specification robustness, that is, we can mount an effective attack against a model using a different tokenizer and architecture, without requiring knowledge on the target model.
Paper Structure (27 sections, 5 equations, 9 figures, 8 tables)

This paper contains 27 sections, 5 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparing true positive rates vs false positive rates of our method with LiRA variants and simple score-function-based methods on WikiText-103, where target model is Pythia-6.9b. LiRA* represents LiRA with fixed variance estimate. LiRA results are obtained with 4 shadow models from Pythia family of varying sizes. Results for our method are obtained with ensemble of 5 quantile regression models fine-tuned from Pythia-160m.
  • Figure 2: True positive rates at 0.1% and 1% FPR on the three datasets where target model is Pythia-6.9b, with varying ensemble sizes of our method. Five independent runs were executed for each setting.
  • Figure 3: Distribution of the standard deviation of z-scores computed from five independent runs of our method with varying ensemble sizes.
  • Figure 4: True positive rates at 0.1% and 1% FPR on all datasets as a function of number of epochs of the target model (OPT-6.7b). MIA risk increases for all methods with additional fine-tuning epochs of the target model.
  • Figure 5: True positive rates at 0.1% and 1% FPR on the three datasets where all target models Pythia models of different sizes. LiRA results obtained using shadow models of smaller sizes than target models are marked with empty circles.
  • ...and 4 more figures