Table of Contents
Fetching ...

Membership Inference Attacks for Unseen Classes

Pratiksha Thaker, Neil Kale, Zhiwei Steven Wu, Virginia Smith

TL;DR

This work defines the unseen-class data-access model for membership inference attacks, where attackers lack samples from certain target classes during attack training, a scenario common in AI-safety auditing. It shows that shadow-model-based MIAs catastrophically fail under this constraint and proposes quantile regression attacks as a robust alternative that generalizes to unseen classes. The authors provide empirical evidence across image, text, and tabular domains, revealing substantial gains (up to 11× higher TPR at low FPR) over shadow-models and a supportive theoretical transferability model for quantile predictors. The findings highlight a critical failure mode in existing MIAs and offer a practical, scalable approach with strong implications for privacy auditing in high-safety contexts.

Abstract

The state-of-the-art for membership inference attacks on machine learning models is a class of attacks based on shadow models that mimic the behavior of the target model on subsets of held-out nonmember data. However, we find that this class of attacks is fundamentally limited because of a key assumption -- that the shadow models can replicate the target model's behavior on the distribution of interest. As a result, we show that attacks relying on shadow models can fail catastrophically on critical AI safety applications where data access is restricted due to legal, ethical, or logistical constraints, so that the shadow models have no reasonable signal on the query examples. Although this problem seems intractable within the shadow model paradigm, we find that quantile regression attacks are a promising approach in this setting, as these models learn features of member examples that can generalize to unseen classes. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to 11x the TPR of shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners that aim to directly use existing tools for real-world applications of AI safety.

Membership Inference Attacks for Unseen Classes

TL;DR

This work defines the unseen-class data-access model for membership inference attacks, where attackers lack samples from certain target classes during attack training, a scenario common in AI-safety auditing. It shows that shadow-model-based MIAs catastrophically fail under this constraint and proposes quantile regression attacks as a robust alternative that generalizes to unseen classes. The authors provide empirical evidence across image, text, and tabular domains, revealing substantial gains (up to 11× higher TPR at low FPR) over shadow-models and a supportive theoretical transferability model for quantile predictors. The findings highlight a critical failure mode in existing MIAs and offer a practical, scalable approach with strong implications for privacy auditing in high-safety contexts.

Abstract

The state-of-the-art for membership inference attacks on machine learning models is a class of attacks based on shadow models that mimic the behavior of the target model on subsets of held-out nonmember data. However, we find that this class of attacks is fundamentally limited because of a key assumption -- that the shadow models can replicate the target model's behavior on the distribution of interest. As a result, we show that attacks relying on shadow models can fail catastrophically on critical AI safety applications where data access is restricted due to legal, ethical, or logistical constraints, so that the shadow models have no reasonable signal on the query examples. Although this problem seems intractable within the shadow model paradigm, we find that quantile regression attacks are a promising approach in this setting, as these models learn features of member examples that can generalize to unseen classes. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to 11x the TPR of shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners that aim to directly use existing tools for real-world applications of AI safety.

Paper Structure

This paper contains 35 sections, 1 theorem, 16 equations, 14 figures, 2 tables.

Key Result

Theorem 5.3

Let $P$ and $Q$ be distributions over $(x, s)$, and let $\phi: \mathcal{X} \to \mathbb{R}^d$ be a fixed feature map. Suppose we learn a linear quantile predictor $q_\alpha(x) = \langle \phi(x), w^* \rangle$ by minimizing the expected pinball loss under $P$: Assume that the density ratio between $Q$ and $P$ satisfies: Then the learned predictor $q_\alpha$ is calibrated under distribution $Q$ at q

Figures (14)

  • Figure 1: True positive rates for shadow model attacks in the 1% false positive rate regime for CINIC-10 and CIFAR-100 (we defer the 0.1% regime to Appendix \ref{['appx:shadowmodel-results']}). Each bar represents the TPR on the indicated class. "Full training" refers to the TPR on class $i$ when no classes are excluded from shadow model training. In yellow, we plot the TPR when that class is excluded from shadow model training. The attack success degrades significantly under class exclusion, often performing worse than the marginal baseline (global threshold).
  • Figure 2: True positive rates in the low false positive regime for CINIC-10 and CIFAR-100 (superclass set) on each unseen class. Each bar represents the true positive rate on class $i$ when class $i$ is dropped from the attack training set. We only report results at 1% FPR; the results at 0.1% FPR are not meaningful due to the small sample size of the validation set on a single class (1000 samples). While quantile regression attacks have only a small advantage over shadow models on CINIC-10 (see Figure \ref{['fig:gmm']}), they achieve up to 11$\times$ higher TPR than shadow models on CIFAR-100.
  • Figure 3: ROC curves for class and sample drop experiments on ImageNet. Enlarged versions of the plots are provided in Appendix \ref{['appx:enlarged-plots']}.
  • Figure 4: True positive rates in the low false positive regime for Texas (tabular) and 20 Newsgroups (text) on sets of unseen classes. Each bar represents the true positive rate on classes $C$ when $C$ are dropped from the attack training set. We only report results at 1% FPR; the results at 0.1% FPR are not meaningful due to the small sample size of the validation set on a single class. Quantile regression attacks achieve up to 2$\times$ higher TPR than shadow models on Texas and up to 6$\times$ higher TPR on 20 Newsgroups.
  • Figure 5: Visualization of Gaussian mixture models fit to the (dimension-reduced) embeddings learned by the quantile regression models trained on on subsets of CINIC-10, CIFAR-100, and Imagenet. Dropping a class largely does not change the distribution over embeddings for CIFAR-100 and Imagenet, where we observe that quantile regression is the most effective.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 5.1: Pinball loss
  • Definition 5.2: Multi-Accuracy for Quantile Prediction
  • Theorem 5.3: Transferability of Quantile Predictors
  • proof