Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

Catherine Huang; Martin Pawelczyk; Himabindu Lakkaraju

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

Catherine Huang, Martin Pawelczyk, Himabindu Lakkaraju

TL;DR

This work tackles the privacy risks of post-hoc feature attribution explanations when fine-tuning foundation vision transformers. It introduces three membership inference attacks—VAR-LRT, L1-LRT, and L2-LRT—that exploit explanation variances and norms to confidently identify training-membership at low false-positive rates, outperforming prior explanation-based methods and rivaling loss-based LiRA in some settings. The authors demonstrate that differentially private fine-tuning can substantially mitigate these risks while preserving high accuracy, using DP-SGD with automatic gradient norm clipping. An extensive empirical study across five vision datasets and multiple ViT architectures substantiates the effectiveness of the attacks and the protective effect of DP, underscoring the need for privacy-aware explainability in high-stakes applications. The results highlight a clear privacy-utility trade-off and point to important directions for future work on robust privacy defenses in explainable ML systems.

Abstract

Predictive machine learning models are becoming increasingly deployed in high-stakes contexts involving sensitive personal data; in these contexts, there is a trade-off between model explainability and data privacy. In this work, we push the boundaries of this trade-off: with a focus on foundation models for image classification fine-tuning, we reveal unforeseen privacy risks of post-hoc model explanations and subsequently offer mitigation strategies for such risks. First, we construct VAR-LRT and L1/L2-LRT, two new membership inference attacks based on feature attribution explanations that are significantly more successful than existing explanation-leveraging attacks, particularly in the low false-positive rate regime that allows an adversary to identify specific training set members with confidence. Second, we find empirically that optimized differentially private fine-tuning substantially diminishes the success of the aforementioned attacks, while maintaining high model accuracy. We carry out a systematic empirical investigation of our 2 new attacks with 5 vision transformer architectures, 5 benchmark datasets, 4 state-of-the-art post-hoc explanation methods, and 4 privacy strength settings.

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

TL;DR

Abstract

Paper Structure (43 sections, 4 equations, 13 figures, 9 tables, 3 algorithms)

This paper contains 43 sections, 4 equations, 13 figures, 9 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
Our Membership Inference Attack Methods on Model Explanations
Experimental Results
Setup
Evaluation of the VAR-LRT Attack
Evaluation of the L1-LRT and L2-LRT Attacks
Comparison with Loss-Based LiRA
Mitigating Attack Success with Differential Privacy
Discussion
Appendix
The Case for Foundation Models and Vision Transformers
The Vision Transformer Architecture
Post-Hoc Feature Attribution Explanations
...and 28 more sections

Figures (13)

Figure 1: VAR-LRT vs. baseline thresholding attack ROCs for the CIFAR-10 (left), CIFAR-100 (middle), and Food 101 (right) datasets. We present results for all explanation methods under each dataset's chosen model and hyperparameter setting.
Figure 2: L1-LRT and L2-LRT attack results for the CIFAR-10 (left), CIFAR-100 (middle), and Food 101 (right) datasets.
Figure 3: L1-LRT attack success of non-private (purple curves) vs. DP fine-tuned models (other curves). We show one plot per explanation method: IXG (left), SL (middle), and (with the exception of CIFAR-100) IG (right). Each subplot shows curves for $\epsilon = 0.5, 1.0, 2.0, 8.0$.
Figure 4: The GELU, ReLU, and ELU (Exponential Linear Unit) elu activation functions. The vision transformer architecture uses GELU activations.
Figure 5: Overview of the vision transformer (ViT) model architecture. ViT splits an image into patches, embeds them (linearly and positionally), and feeds the embeddings into a Transformer encoder.
...and 8 more figures

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

TL;DR

Abstract

Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (13)