Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

Chenxi Li; Abhinav Kumar; Zhen Guo; Jie Hou; Reza Tourani

Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

Chenxi Li, Abhinav Kumar, Zhen Guo, Jie Hou, Reza Tourani

TL;DR

This work tackles privacy risks from white-box MIAs by identifying a small subset of hidden activations that most strongly influence membership leakage. It introduces a neuron-selection pipeline guided by statistical tests and SHAP-based ensembling, plus an attack-driven explainable framework that links raw input features to MIA success via a cascaded target–attack model with forward hooks. The authors demonstrate up to $26.9\%$ improvements over prior white-box MIAs across multiple datasets and architectures, and quantify the overlap between features driving classification and membership inference using SHAP and SSIM analyses. The findings offer practical guidance for designing defenses that perturb high-impact raw features while preserving target task performance, leveraging interpretable insights into the privacy-attack mechanisms.

Abstract

The increasing prominence of deep learning applications and reliance on personalized data underscore the urgent need to address privacy vulnerabilities, particularly Membership Inference Attacks (MIAs). Despite numerous MIA studies, significant knowledge gaps persist, particularly regarding the impact of hidden features (in isolation) on attack efficacy and insufficient justification for the root causes of attacks based on raw data features. In this paper, we aim to address these knowledge gaps by first exploring statistical approaches to identify the most informative neurons and quantifying the significance of the hidden activations from the selected neurons on attack accuracy, in isolation and combination. Additionally, we propose an attack-driven explainable framework by integrating the target and attack models to identify the most influential features of raw data that lead to successful membership inference attacks. Our proposed MIA shows an improvement of up to 26% on state-of-the-art MIA.

Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

TL;DR

improvements over prior white-box MIAs across multiple datasets and architectures, and quantify the overlap between features driving classification and membership inference using SHAP and SSIM analyses. The findings offer practical guidance for designing defenses that perturb high-impact raw features while preserving target task performance, leveraging interpretable insights into the privacy-attack mechanisms.

Abstract

Paper Structure (20 sections, 1 equation, 12 figures, 4 tables)

This paper contains 20 sections, 1 equation, 12 figures, 4 tables.

Introduction
Related Work
Threat Model and Security Assumptions
White-box Attack Methodology
Target Model Training and Membership Distribution Analysis
Membership Guided Neuron Selection
MIA Training and Ensemble
Explainable Membership Inference Methodology
Experiments
Datasets
Models' Specifications
Evaluation Metrics
Evaluation Results
Target Model Performance
Attack Performance
...and 5 more sections

Figures (12)

Figure 1: Performing PCA analysis on second-to-last layer's output from both member and non-member data samples using a ResNet architecture trained on the UTKFace dataset. The results reveal that selecting the top 20% of the most influential neurons results in a more pronounced separation between member and non-member samples compared to using all neurons.
Figure 2: Our proposed MIA framework analyzes membership distribution of data samples for selecting the most influential subset of neurons as attack features. For ensembling the final MIA model, it uses SHAP to select the most significant MIA models.
Figure 3: The proposed explainable MIA framework, in which the raw feature importance will be selected based on the target model (a), as well as the combination of the target and attack models (b), to identify the significant raw features that leak more information regarding sample membership.
Figure 4: Attack accuracy trend when using 100% of neurons from the last three layers.
Figure 5: MIA accuracy under different percentages of neurons of the last layer on two target models using five techniques across different datasets.
...and 7 more figures

Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

TL;DR

Abstract

Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

Authors

TL;DR

Abstract

Table of Contents

Figures (12)