Table of Contents
Fetching ...

Semantic Membership Inference Attack against Large Language Models

Hamid Mozaffari, Virendra J. Marathe

TL;DR

This work tackles privacy risks from memorization in large language models by introducing SMIA, a semantic Membership Inference Attack. SMIA leverages semantic perturbations of inputs and a learned binary attacker to detect whether a data point was part of a model's training data, outperforming prior MIAs on Pythia and GPT-Neo across verbatim and modified settings. The approach combines neighbor generation via a masking model, semantic embeddings, and loss-based signals to capture semantic memorization, achieving notable gains in AUC-ROC and robust performance when inputs are semantically altered. The findings have practical implications for privacy auditing, unlearning, and understanding the limits of data redaction in LLM training.

Abstract

Membership Inference Attacks (MIAs) determine whether a specific data point was included in the training set of a target model. In this paper, we introduce the Semantic Membership Inference Attack (SMIA), a novel approach that enhances MIA performance by leveraging the semantic content of inputs and their perturbations. SMIA trains a neural network to analyze the target model's behavior on perturbed inputs, effectively capturing variations in output probability distributions between members and non-members. We conduct comprehensive evaluations on the Pythia and GPT-Neo model families using the Wikipedia dataset. Our results show that SMIA significantly outperforms existing MIAs; for instance, SMIA achieves an AUC-ROC of 67.39% on Pythia-12B, compared to 58.90% by the second-best attack.

Semantic Membership Inference Attack against Large Language Models

TL;DR

This work tackles privacy risks from memorization in large language models by introducing SMIA, a semantic Membership Inference Attack. SMIA leverages semantic perturbations of inputs and a learned binary attacker to detect whether a data point was part of a model's training data, outperforming prior MIAs on Pythia and GPT-Neo across verbatim and modified settings. The approach combines neighbor generation via a masking model, semantic embeddings, and loss-based signals to capture semantic memorization, achieving notable gains in AUC-ROC and robust performance when inputs are semantically altered. The findings have practical implications for privacy auditing, unlearning, and understanding the limits of data redaction in LLM training.

Abstract

Membership Inference Attacks (MIAs) determine whether a specific data point was included in the training set of a target model. In this paper, we introduce the Semantic Membership Inference Attack (SMIA), a novel approach that enhances MIA performance by leveraging the semantic content of inputs and their perturbations. SMIA trains a neural network to analyze the target model's behavior on perturbed inputs, effectively capturing variations in output probability distributions between members and non-members. We conduct comprehensive evaluations on the Pythia and GPT-Neo model families using the Wikipedia dataset. Our results show that SMIA significantly outperforms existing MIAs; for instance, SMIA achieves an AUC-ROC of 67.39% on Pythia-12B, compared to 58.90% by the second-best attack.
Paper Structure (27 sections, 5 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Our Semantic Membership Inference Attack (SMIA) inference pipeline.
  • Figure 2: Input features for our SMIA: semantic change and taregt model behaviour change for inputs and their neighbors.
  • Figure 3: Effect of different training size on the validation loss of SMIA for 20 epochs.
  • Figure 4: Similarity scores of generated neighbors for our training datasets for member and non-member
  • Figure 5: An example for input sample and different modifications.