Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

Wenjie Fu; Huandong Wang; Chen Gao; Guanghua Liu; Yong Li; Tao Jiang

Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang

TL;DR

The paper investigates privacy risks of membership inference attacks on fine-tuned large language models and identifies limitations in prior methods that rely on reference data quality or overfitting. It introduces SPV-MIA, a self-calibrated MIA that generates a self-prompt reference model from the target LLM and uses a memorization-based probabilistic variation signal, supported by practical difficulty calibration (PDC) and probabilistic variation assessment (PVA). Empirical results across four open-source LLMs and three domains show substantial improvements in attack effectiveness (average AUC ~0.92) over seven baselines, along with analyses of reference data quality, robustness, and defenses like DP-SGD. The work highlights realistic privacy risks in fine-tuned LLM deployment and provides a versatile framework for evaluating and strengthening model privacy in practical settings.

Abstract

Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Existing MIAs designed for large language models (LLMs) can be bifurcated into two types: reference-free and reference-based attacks. Although reference-based attacks appear promising performance by calibrating the probability measured on the target model with reference models, this illusion of privacy risk heavily depends on a reference dataset that closely resembles the training set. Both two types of attacks are predicated on the hypothesis that training records consistently maintain a higher probability of being sampled. However, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. Thus, these reasons lead to high false-positive rates of MIAs in practical scenarios. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs. Furthermore, we introduce probabilistic variation, a more reliable membership signal based on LLM memorization rather than overfitting, from which we rediscover the neighbour attack with theoretical grounding. Comprehensive evaluation conducted on three datasets and four exemplary LLMs shows that SPV-MIA raises the AUC of MIAs from 0.7 to a significantly high level of 0.9. Our code and dataset are available at: https://github.com/tsinghua-fib-lab/NeurIPS2024_SPV-MIA

Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

TL;DR

Abstract

Paper Structure (33 sections, 11 equations, 7 figures, 12 tables, 2 algorithms)

This paper contains 33 sections, 11 equations, 7 figures, 12 tables, 2 algorithms.

Introduction
Related Works
Preliminaries
Causal Language Models
Threat Model
Membership Inference Attack via Self-calibrated Probabilistic Variation
General Paradigm
Practical Difficulty Calibration (PDC) via Self-prompt Reference Model
Probabilistic Variation Assessment (PVA) via Symmetrical Paraphrasing
Experiments
Experimental Setup
Overall Performance
How MIAs Rely on Reference Dataset Quality
The Robustness of SPV-MIA in Practical Scenarios
Defending against SPV-MIAs
...and 18 more sections

Figures (7)

Figure 1: Attack performances of the reference-based MIA (LiRA mireshghallah2022quantifyingmireshghallah2022empiricalye2022enhancedcarlini2022quantifying) and reference-free MIA (LOSS Attack yeom2018privacy) unsatisfy against LLMs in practical scenarios, where LLMs are in the memorization stage and only domain-specific dataset is available. (a) Reference-based MIA shows a catastrophic plummet in performance when the similarity between the reference and training datasets declines. (b) Existing MIAs are unable to pose privacy leakages on LLMs that only exhibit memorization, an inevitable phenomenon occurs much earlier than overfitting and persists throughout almost the entire training phase mireshghallah2022empiricalzhang2023counterfactualtirumala2022memorization.
Figure 2: The overall workflow of SPV-MIA, where includes the probabilistic calibration via self-prompt reference model and the probabilistic variation assessment via paraphrasing model.
Figure 3: The performances of reference-based MIA on LLaMA while utilizing different reference datasets.
Figure 4: The performances of SPV-MIA on LLaMA while utilizing different prompt text sources.
Figure 5: The performances of SPV-MIA on LLaMA while utilizing different query count to the target model and different prompt text lengths.
...and 2 more figures

Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

TL;DR

Abstract

Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

Authors

TL;DR

Abstract

Table of Contents

Figures (7)