Table of Contents
Fetching ...

Watermarking Counterfactual Explanations

Hangzhi Guo, Firdaus Ahmed Choudhury, Tinghua Chen, Amulya Yadav

TL;DR

This work proposes CFMark, a novel model-agnostic watermarking framework for detecting unauthorized model extraction attacks relying on CF explanations, and establishes a critical foundation for the secure deployment of CF explanations in real-world applications.

Abstract

Counterfactual (CF) explanations for ML model predictions provide actionable recourse recommendations to individuals adversely impacted by predicted outcomes. However, despite being preferred by end-users, CF explanations have been shown to pose significant security risks in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on the underlying proprietary ML model. To address this security challenge, we propose CFMark, a novel model-agnostic watermarking framework for detecting unauthorized model extraction attacks relying on CF explanations. CFMark involves a novel bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks using these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme. At the same time, the embedded watermark does not compromise the quality of the CF explanations. We evaluate CFMark across diverse real-world datasets, CF explanation methods, and model extraction techniques. Our empirical results demonstrate CFMark's effectiveness, achieving an F-1 score of ~0.89 in identifying unauthorized model extraction attacks using watermarked CF explanations. Importantly, this watermarking incurs only a negligible degradation in the quality of generated CF explanations (i.e., ~1.3% degradation in validity and ~1.6% in proximity). Our work establishes a critical foundation for the secure deployment of CF explanations in real-world applications.

Watermarking Counterfactual Explanations

TL;DR

This work proposes CFMark, a novel model-agnostic watermarking framework for detecting unauthorized model extraction attacks relying on CF explanations, and establishes a critical foundation for the secure deployment of CF explanations in real-world applications.

Abstract

Counterfactual (CF) explanations for ML model predictions provide actionable recourse recommendations to individuals adversely impacted by predicted outcomes. However, despite being preferred by end-users, CF explanations have been shown to pose significant security risks in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on the underlying proprietary ML model. To address this security challenge, we propose CFMark, a novel model-agnostic watermarking framework for detecting unauthorized model extraction attacks relying on CF explanations. CFMark involves a novel bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks using these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme. At the same time, the embedded watermark does not compromise the quality of the CF explanations. We evaluate CFMark across diverse real-world datasets, CF explanation methods, and model extraction techniques. Our empirical results demonstrate CFMark's effectiveness, achieving an F-1 score of ~0.89 in identifying unauthorized model extraction attacks using watermarked CF explanations. Importantly, this watermarking incurs only a negligible degradation in the quality of generated CF explanations (i.e., ~1.3% degradation in validity and ~1.6% in proximity). Our work establishes a critical foundation for the secure deployment of CF explanations in real-world applications.
Paper Structure (17 sections, 3 theorems, 11 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 3 theorems, 11 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Suppose $p_x$ is the posterior probability of $x$ predicted by the suspicious model. Let $x^\text{cf}$ and $\hat{x}^{\text{cf}}$ each represent the unwatermarked and watermarked counterfactual explanations. Let $\bar{p}_{x^{\text{cf}}}$ and $\bar{p}_{\hat{x}^{\text{cf}}}$ each denote the mean of the

Figures (9)

  • Figure 1: Illustration of CFMark. (a) ① Model extraction Attack. The adversaries use querying data $D^x$ and their CF explanations $D^\text{cf}(\boldsymbol{\theta})$ to train a private model $f_w$ that reproduces the predictive behavior of the proprietary model $F_W$. ② Watermarking. Given CF explanations $x^\text{cf}$ and the proprietary model $F_W$, CFMark embeds a watermark into CF explanations $\hat{x}^\text{cf} = G_\theta(x^\text{cf})$. (b) Model ownership verification. If the adversaries use $D^\text{cf}(\boldsymbol{\theta})$ to train an extracted model $f_w$, our framework can identify this unauthorized usage via hypothesis testing.
  • Figure 2: Illustration of the evaluation procedure for watermarking. Green indicates positive cases, and red indicates negative cases.
  • Figure 3: The impact of the regularization term on the credit dataset when using DiCE.
  • Figure 4: The impact of using training data $D^t$ in CFMark when evaluated on credit dataset.
  • Figure 5: Robustness of CFMark to backdoor removal.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Theorem 4.1
  • Theorem A.1
  • proof