Table of Contents
Fetching ...

GLiRA: Black-Box Membership Inference Attack via Knowledge Distillation

Andrey V. Galichin, Mikhail Pautov, Alexey Zhavoronkin, Oleg Y. Rogov, Ivan Oseledets

TL;DR

This work tackles the privacy risks of training data in deep networks by developing GLiRA, a black-box membership inference attack guided by knowledge distillation. By training shadow networks with knowledge distillation from the target and applying a Likelihood Ratio Attack in an offline setting, GLiRA exploits logit-level information to distinguish training-membership more accurately than prior methods. The authors compare two distillation losses, KL and Mean Squared Error, and demonstrate that the KD-based MSE variant generally achieves higher accuracy at low false-positive rates across multiple datasets and architectures, while GLiRA-KL offers strong performance at higher FPR. The results suggest that distillation-based shadow modeling substantially enhances membership inference in black-box scenarios, underscoring privacy risks in API-accessible models and motivating defenses that address logit-level leakage and distribution alignment.

Abstract

While Deep Neural Networks (DNNs) have demonstrated remarkable performance in tasks related to perception and control, there are still several unresolved concerns regarding the privacy of their training data, particularly in the context of vulnerability to Membership Inference Attacks (MIAs). In this paper, we explore a connection between the susceptibility to membership inference attacks and the vulnerability to distillation-based functionality stealing attacks. In particular, we propose {GLiRA}, a distillation-guided approach to membership inference attack on the black-box neural network. We observe that the knowledge distillation significantly improves the efficiency of likelihood ratio of membership inference attack, especially in the black-box setting, i.e., when the architecture of the target model is unknown to the attacker. We evaluate the proposed method across multiple image classification datasets and models and demonstrate that likelihood ratio attacks when guided by the knowledge distillation, outperform the current state-of-the-art membership inference attacks in the black-box setting.

GLiRA: Black-Box Membership Inference Attack via Knowledge Distillation

TL;DR

This work tackles the privacy risks of training data in deep networks by developing GLiRA, a black-box membership inference attack guided by knowledge distillation. By training shadow networks with knowledge distillation from the target and applying a Likelihood Ratio Attack in an offline setting, GLiRA exploits logit-level information to distinguish training-membership more accurately than prior methods. The authors compare two distillation losses, KL and Mean Squared Error, and demonstrate that the KD-based MSE variant generally achieves higher accuracy at low false-positive rates across multiple datasets and architectures, while GLiRA-KL offers strong performance at higher FPR. The results suggest that distillation-based shadow modeling substantially enhances membership inference in black-box scenarios, underscoring privacy risks in API-accessible models and motivating defenses that address logit-level leakage and distribution alignment.

Abstract

While Deep Neural Networks (DNNs) have demonstrated remarkable performance in tasks related to perception and control, there are still several unresolved concerns regarding the privacy of their training data, particularly in the context of vulnerability to Membership Inference Attacks (MIAs). In this paper, we explore a connection between the susceptibility to membership inference attacks and the vulnerability to distillation-based functionality stealing attacks. In particular, we propose {GLiRA}, a distillation-guided approach to membership inference attack on the black-box neural network. We observe that the knowledge distillation significantly improves the efficiency of likelihood ratio of membership inference attack, especially in the black-box setting, i.e., when the architecture of the target model is unknown to the attacker. We evaluate the proposed method across multiple image classification datasets and models and demonstrate that likelihood ratio attacks when guided by the knowledge distillation, outperform the current state-of-the-art membership inference attacks in the black-box setting.
Paper Structure (29 sections, 19 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 29 sections, 19 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Histograms of the logits of the ground truth class for different architectures of the target model, CIFAR10 training dataset. We observe a notable difference between the histograms, which can lead to a decreased alignment between target and shadow models if their architectures differ.
  • Figure 2: The illustration of the proposed pipeline for shadow models training. We are given the target model $f^t$, which can be queried via an API. We sample a training dataset $D_\text{shadow}$ from the underlying training data distribution $\mathbb{D}$, and train the shadow model using the knowledge distillation procedure $\texttt{Distill}(D_\text{shadow}, f^t)$. The process is repeated $m$ times to obtain the final set of shadow models $\{f_1, f_2, ..., f_m\}$. After that, an adversary can use the shadow models to determine the membership status of a given data point.
  • Figure 3: The effect of the balancing factor $\alpha$ and the temperature parameter $\tau$ from Eq. \ref{['eq:knowledge_distillation_kl']} on the success rate of the proposed attack methods. We present results on a fixed low FPR rate of $0.01 \%$ and consider two experimental setups. Blue: the architecture of target and shadow models is the same (namely, MobileNet-V2). Orange: the architecture of the target model is MobileNet-V2; the architecture of shadow models is ResNet34.
  • Figure 4: The quantitative results of experiments. We compare the performance of different attack methods in the setting when the adversary is aware of the target model architecture and uses it to train shadow models. Results are presented for three different datasets and four model architectures (from top to bottom: MobileNet-V2, ResNet-34, VGG16, WideResNet28-10).
  • Figure 5: The quantitative results of experiments. We compare the performance of different attack methods in the setting when the adversary is unaware of the target model architecture and, hence, can not use it to train shadow models. Results are presented for three different datasets and four model architectures (from top to bottom: Target MobileNet-V2, Shadow ResNet-34; Target ResNet-34, Shadow VGG16; Target VGG16, Shadow WideResNet28-10; Target WideResNet28-10, Shadow MobileNet-V2).
  • ...and 4 more figures