Towards a Game-theoretic Understanding of Explanation-based Membership Inference Attacks

Kavita Kumari; Murtuza Jadliwala; Sumit Kumar Jha; Anindya Maiti

Towards a Game-theoretic Understanding of Explanation-based Membership Inference Attacks

Kavita Kumari, Murtuza Jadliwala, Sumit Kumar Jha, Anindya Maiti

TL;DR

The work advances privacy analysis for explainable AI by modeling explanation-based membership inference attacks as a two-player, continuous-time stochastic signaling game in which explanation variance evolves as a Geometric Brownian Motion. It proves the theoretical existence of an optimal explanation-variance threshold and characterizes a unique Markov Perfect Equilibrium under pooling, demonstrating how an attacker can exploit historical explanation variance to infer membership. Through extensive experiments on five datasets and multiple gradient-based explanation methods, the paper shows that attack feasibility hinges on explanation technique, input dimensionality, model capacity, and training iterations, with CIFAR-10 particularly challenging due to high variance. The results provide a principled framework for understanding privacy risks in explainable ML and offer guidance for defenses that adjust explanation variance or detection thresholds to mitigate MIA risk.

Abstract

Model explanations improve the transparency of black-box machine learning (ML) models and their decisions; however, they can also be exploited to carry out privacy threats such as membership inference attacks (MIA). Existing works have only analyzed MIA in a single "what if" interaction scenario between an adversary and the target ML model; thus, it does not discern the factors impacting the capabilities of an adversary in launching MIA in repeated interaction settings. Additionally, these works rely on assumptions about the adversary's knowledge of the target model's structure and, thus, do not guarantee the optimality of the predefined threshold required to distinguish the members from non-members. In this paper, we delve into the domain of explanation-based threshold attacks, where the adversary endeavors to carry out MIA attacks by leveraging the variance of explanations through iterative interactions with the system comprising of the target ML model and its corresponding explanation method. We model such interactions by employing a continuous-time stochastic signaling game framework. In our framework, an adversary plays a stopping game, interacting with the system (having imperfect information about the type of an adversary, i.e., honest or malicious) to obtain explanation variance information and computing an optimal threshold to determine the membership of a datapoint accurately. First, we propose a sound mathematical formulation to prove that such an optimal threshold exists, which can be used to launch MIA. Then, we characterize the conditions under which a unique Markov perfect equilibrium (or steady state) exists in this dynamic system. By means of a comprehensive set of simulations of the proposed game model, we assess different factors that can impact the capability of an adversary to launch MIA in such repeated interaction settings.

Towards a Game-theoretic Understanding of Explanation-based Membership Inference Attacks

TL;DR

Abstract

Paper Structure (28 sections, 7 theorems, 103 equations, 3 figures, 2 tables)

This paper contains 28 sections, 7 theorems, 103 equations, 3 figures, 2 tables.

Introduction
Background and Preliminaries
Machine Learning
Gradient based Explanations
Membership Inference Attacks
Geometric Brownian Motion
Optimal Control and the Stopping Problem
Game Model
Intuition
Setup and Assumptions
Equilibrium Description
Equilibrium Analysis
Value Functions
Analytical Results
Experimental Setup
...and 13 more sections

Key Result

Lemma 1

There exists a positive upper bound $u_{th}$ on the variance of an explanation generated by an explanation method representing the maximum variance value that can be reached for the query sent by the end-user.

Figures (3)

Figure 1: Illustration of a continuous path analysis of $U(\pi)$ and $L(\pi)$ in Markov Perfect Equilibrium
Figure 2: Different functional paths for the different datasets. (a), (c), (e), (g), and (i) represents the optimal functional paths for the end-user. (b), (d), (f), (h), and (j) represents the optimal functional paths for the system.
Figure 3: Accuracy ($TPR$) for the optimal strategy obtained by the system and the end-user: a) $\emph{Gradient*Input}$ method and b) Other explanation methods.

Theorems & Definitions (14)

Lemma 1
Lemma 2
Theorem 1
Lemma 3
Lemma 4
Theorem 2
Theorem 3
proof
proof
proof
...and 4 more

Towards a Game-theoretic Understanding of Explanation-based Membership Inference Attacks

TL;DR

Abstract

Towards a Game-theoretic Understanding of Explanation-based Membership Inference Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (14)