Semantic Membership Inference Attack against Large Language Models
Hamid Mozaffari, Virendra J. Marathe
TL;DR
This work tackles privacy risks from memorization in large language models by introducing SMIA, a semantic Membership Inference Attack. SMIA leverages semantic perturbations of inputs and a learned binary attacker to detect whether a data point was part of a model's training data, outperforming prior MIAs on Pythia and GPT-Neo across verbatim and modified settings. The approach combines neighbor generation via a masking model, semantic embeddings, and loss-based signals to capture semantic memorization, achieving notable gains in AUC-ROC and robust performance when inputs are semantically altered. The findings have practical implications for privacy auditing, unlearning, and understanding the limits of data redaction in LLM training.
Abstract
Membership Inference Attacks (MIAs) determine whether a specific data point was included in the training set of a target model. In this paper, we introduce the Semantic Membership Inference Attack (SMIA), a novel approach that enhances MIA performance by leveraging the semantic content of inputs and their perturbations. SMIA trains a neural network to analyze the target model's behavior on perturbed inputs, effectively capturing variations in output probability distributions between members and non-members. We conduct comprehensive evaluations on the Pythia and GPT-Neo model families using the Wikipedia dataset. Our results show that SMIA significantly outperforms existing MIAs; for instance, SMIA achieves an AUC-ROC of 67.39% on Pythia-12B, compared to 58.90% by the second-best attack.
