Explaining the Model, Protecting Your Data: Revealing and Mitigating the Data Privacy Risks of Post-Hoc Model Explanations via Membership Inference
Catherine Huang, Martin Pawelczyk, Himabindu Lakkaraju
TL;DR
This work tackles the privacy risks of post-hoc feature attribution explanations when fine-tuning foundation vision transformers. It introduces three membership inference attacks—VAR-LRT, L1-LRT, and L2-LRT—that exploit explanation variances and norms to confidently identify training-membership at low false-positive rates, outperforming prior explanation-based methods and rivaling loss-based LiRA in some settings. The authors demonstrate that differentially private fine-tuning can substantially mitigate these risks while preserving high accuracy, using DP-SGD with automatic gradient norm clipping. An extensive empirical study across five vision datasets and multiple ViT architectures substantiates the effectiveness of the attacks and the protective effect of DP, underscoring the need for privacy-aware explainability in high-stakes applications. The results highlight a clear privacy-utility trade-off and point to important directions for future work on robust privacy defenses in explainable ML systems.
Abstract
Predictive machine learning models are becoming increasingly deployed in high-stakes contexts involving sensitive personal data; in these contexts, there is a trade-off between model explainability and data privacy. In this work, we push the boundaries of this trade-off: with a focus on foundation models for image classification fine-tuning, we reveal unforeseen privacy risks of post-hoc model explanations and subsequently offer mitigation strategies for such risks. First, we construct VAR-LRT and L1/L2-LRT, two new membership inference attacks based on feature attribution explanations that are significantly more successful than existing explanation-leveraging attacks, particularly in the low false-positive rate regime that allows an adversary to identify specific training set members with confidence. Second, we find empirically that optimized differentially private fine-tuning substantially diminishes the success of the aforementioned attacks, while maintaining high model accuracy. We carry out a systematic empirical investigation of our 2 new attacks with 5 vision transformer architectures, 5 benchmark datasets, 4 state-of-the-art post-hoc explanation methods, and 4 privacy strength settings.
