Table of Contents
Fetching ...

AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Interpretable Models

Abdullah Caglar Oksuz, Anisa Halimi, Erman Ayday

TL;DR

This work addresses the risk that explainable AI tools can facilitate model extraction attacks on interpretable models under black-box access. It introduces AUTOLYCUS, a retraining-based, explanation-guided attack that uses LIME and SHAP signals to generate informative queries and build surrogate models with limited queries. Across six dataset scenarios, AUTOLYCUS achieves high surrogate fidelity with substantially fewer queries than state-of-the-art baselines and demonstrates transferability to multiple interpretable architectures, while countermeasures such as differential privacy and explanation perturbations prove largely ineffective. The study highlights a practical vulnerability in MLaaS deployments and calls for defenses that consider explanation-driven leakage, with implications for the design of secure, privacy-preserving explainability tools.

Abstract

Explainable Artificial Intelligence (XAI) aims to uncover the decision-making processes of AI models. However, the data used for such explanations can pose security and privacy risks. Existing literature identifies attacks on machine learning models, including membership inference, model inversion, and model extraction attacks. These attacks target either the model or the training data, depending on the settings and parties involved. XAI tools can increase the vulnerability of model extraction attacks, which is a concern when model owners prefer black-box access, thereby keeping model parameters and architecture private. To exploit this risk, we propose AUTOLYCUS, a novel retraining (learning) based model extraction attack framework against interpretable models under black-box settings. As XAI tools, we exploit Local Interpretable Model-Agnostic Explanations (LIME) and Shapley values (SHAP) to infer decision boundaries and create surrogate models that replicate the functionality of the target model. LIME and SHAP are mainly chosen for their realistic yet information-rich explanations, coupled with their extensive adoption, simplicity, and usability. We evaluate AUTOLYCUS on six machine learning datasets, measuring the accuracy and similarity of the surrogate model to the target model. The results show that AUTOLYCUS is highly effective, requiring significantly fewer queries compared to state-of-the-art attacks, while maintaining comparable accuracy and similarity. We validate its performance and transferability on multiple interpretable ML models, including decision trees, logistic regression, naive bayes, and k-nearest neighbor. Additionally, we show the resilience of AUTOLYCUS against proposed countermeasures.

AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Interpretable Models

TL;DR

This work addresses the risk that explainable AI tools can facilitate model extraction attacks on interpretable models under black-box access. It introduces AUTOLYCUS, a retraining-based, explanation-guided attack that uses LIME and SHAP signals to generate informative queries and build surrogate models with limited queries. Across six dataset scenarios, AUTOLYCUS achieves high surrogate fidelity with substantially fewer queries than state-of-the-art baselines and demonstrates transferability to multiple interpretable architectures, while countermeasures such as differential privacy and explanation perturbations prove largely ineffective. The study highlights a practical vulnerability in MLaaS deployments and calls for defenses that consider explanation-driven leakage, with implications for the design of secure, privacy-preserving explainability tools.

Abstract

Explainable Artificial Intelligence (XAI) aims to uncover the decision-making processes of AI models. However, the data used for such explanations can pose security and privacy risks. Existing literature identifies attacks on machine learning models, including membership inference, model inversion, and model extraction attacks. These attacks target either the model or the training data, depending on the settings and parties involved. XAI tools can increase the vulnerability of model extraction attacks, which is a concern when model owners prefer black-box access, thereby keeping model parameters and architecture private. To exploit this risk, we propose AUTOLYCUS, a novel retraining (learning) based model extraction attack framework against interpretable models under black-box settings. As XAI tools, we exploit Local Interpretable Model-Agnostic Explanations (LIME) and Shapley values (SHAP) to infer decision boundaries and create surrogate models that replicate the functionality of the target model. LIME and SHAP are mainly chosen for their realistic yet information-rich explanations, coupled with their extensive adoption, simplicity, and usability. We evaluate AUTOLYCUS on six machine learning datasets, measuring the accuracy and similarity of the surrogate model to the target model. The results show that AUTOLYCUS is highly effective, requiring significantly fewer queries compared to state-of-the-art attacks, while maintaining comparable accuracy and similarity. We validate its performance and transferability on multiple interpretable ML models, including decision trees, logistic regression, naive bayes, and k-nearest neighbor. Additionally, we show the resilience of AUTOLYCUS against proposed countermeasures.
Paper Structure (15 sections, 8 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: LIME explanation example from the Crop dataset
  • Figure 2: SHAP explanation example from the Crop dataset
  • Figure 3: AUTOLYCUS system diagram consisting of the following steps: (1) a user sends a query to the MLaaS platform, (2) the MLaaS platform verifies the validity of the query such that no empty or incomplete queries are sent, (3) the ML model $M$ predicts the class of the queried sample $y_i$ and the explainer computes its explanation $E_i$, (4) the MLaaS platform returns the results to the user, and (5) in case of an adversarial user, they exploit explanations via TRAV-A algorithm (as described in Section \ref{['sec:generation']}) to extract the decision boundaries of the target model $M$.
  • Figure 4: Impact of the number of queries ($Q$) on surrogate model similarity when LIME is used. $(k=3, n=1)$
  • Figure 5: Impact of the number of queries ($Q$) on surrogate model similarity when SHAP is used. $(k=3, n=5)$
  • ...and 3 more figures