Table of Contents
Fetching ...

Model extraction from counterfactual explanations

Ulrich Aïvodji, Alexandre Bolot, Sébastien Gambs

TL;DR

This work reveals a security/privacy risk in post-hoc explanations: counterfactual explanations can be exploited to extract a target black-box model with high fidelity and accuracy using surprisingly few queries. The authors formalize explanation-based, fidelity-focused attacks, describe multiple adversarial scenarios, and demonstrate strong extraction performance on real-world datasets, even under partial knowledge of data distributions or unknown architectures. They show that providing multiple, diverse counterfactuals further boosts attack effectiveness, highlighting a tension between explanation realism and privacy. The results motivate developing privacy-preserving, inherently transparent models or restricted explanation interfaces to mitigate such leakage while maintaining beneficial interpretability.

Abstract

Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.

Model extraction from counterfactual explanations

TL;DR

This work reveals a security/privacy risk in post-hoc explanations: counterfactual explanations can be exploited to extract a target black-box model with high fidelity and accuracy using surprisingly few queries. The authors formalize explanation-based, fidelity-focused attacks, describe multiple adversarial scenarios, and demonstrate strong extraction performance on real-world datasets, even under partial knowledge of data distributions or unknown architectures. They show that providing multiple, diverse counterfactuals further boosts attack effectiveness, highlighting a tension between explanation realism and privacy. The results motivate developing privacy-preserving, inherently transparent models or restricted explanation interfaces to mitigate such leakage while maintaining beneficial interpretability.

Abstract

Post-hoc explanation techniques refer to a posteriori methods that can be used to explain how black-box machine learning models produce their outcomes. Among post-hoc explanation techniques, counterfactual explanations are becoming one of the most popular methods to achieve this objective. In particular, in addition to highlighting the most important features used by the black-box model, they provide users with actionable explanations in the form of data instances that would have received a different outcome. Nonetheless, by doing so, they also leak non-trivial information about the model itself, which raises privacy issues. In this work, we demonstrate how an adversary can leverage the information provided by counterfactual explanations to build high-fidelity and high-accuracy model extraction attacks. More precisely, our attack enables the adversary to build a faithful copy of a target model by accessing its counterfactual explanations. The empirical evaluation of the proposed attack on black-box models trained on real-world datasets demonstrates that they can achieve high-fidelity and high-accuracy extraction even under low query budgets.

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of a counterfactual explanation scenario. Given an original instance for which the model predicts the loan denied class, a counterfactual explanation framework provides different instances that are close to the original one but belong to the desired class (loan approved here). An individual asking for an explanation can thus see which aspects of his profile he may try to change to yield the desired outcome.
  • Figure 2: Illustration of a traditional model extraction attack and an explanation-based model extraction. In the former, the adversary relies on the predictions $\mathcal{B}(x_1),\ldots,\mathcal{B}(x_n)$ of the target model $\mathcal{B}$ to build the surrogate model $S_\mathcal{A}{}$ using a process $\psi(\cdot)$, while in the later, the adversary combines the predictions $\mathcal{B}(x_1),\ldots,\mathcal{B}(x_n)$ and the explanations $\mathcal{E}(x_1),\ldots,\mathcal{E}(x_n)$ of the target model $\mathcal{B}$ to generate the surrogate $S_\mathcal{A}{}$ using another process $\psi'(\cdot)$.
  • Figure 3: Decision boundary of the surrogate model on Adult Income dataset frank2010uci.
  • Figure 4: Performances (i.e., fidelity) of the model extraction attack in scenario (S4) for Adult Income. Results demonstrate the impact of the number of counterfactual explanations per query on the extraction attack's fidelity.
  • Figure 5: Performances (fidelity) of the model extraction attack in scenario (S5) for Adult Income dataset. The results show the impact of the proximity and the diversity metrics on the fidelity of the surrogate.

Theorems & Definitions (1)

  • Definition 1: Explanation-based model extraction