Table of Contents
Fetching ...

Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

Hubert Baniecki, Przemyslaw Biecek

TL;DR

This work critically examines claims that intrinsically interpretable models are robust and reliably interpretable. It introduces adversarial prototype manipulation and backdoor attacks on prototype-based networks (notably ProtoViT and PIP-Net) and discusses potential defenses via concept bottleneck models. Across bird species recognition and medical-imaging tasks, the study shows that high accuracy can coexist with superficial, misleading explanations and vulnerable reasoning, challenging the presumed safety of interpretable architectures. The findings highlight substantial gaps in robustness and interpretability, urging the development of secure, aligned interpretable models for high-stakes applications.

Abstract

A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of part-prototype networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.

Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

TL;DR

This work critically examines claims that intrinsically interpretable models are robust and reliably interpretable. It introduces adversarial prototype manipulation and backdoor attacks on prototype-based networks (notably ProtoViT and PIP-Net) and discusses potential defenses via concept bottleneck models. Across bird species recognition and medical-imaging tasks, the study shows that high accuracy can coexist with superficial, misleading explanations and vulnerable reasoning, challenging the presumed safety of interpretable architectures. The findings highlight substantial gaps in robustness and interpretability, urging the development of secure, aligned interpretable models for high-stakes applications.

Abstract

A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of part-prototype networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.

Paper Structure

This paper contains 36 sections, 5 equations, 14 figures, 2 tables, 2 algorithms.

Figures (14)

  • Figure 1: Birds look like cars. Interpretation of image recognition by a ProtoViT classifying bird species based on car prototypes with 85% accuracy. We overemphasize that what looks alike to humans is not representative of what is similar according to the model. Below is a schema of the prototype-based model's architecture with a highlighted point of failure. Manipulating the model's reasoning by exploiting its use of latent prototypes ($\theta_g$) manifests the inherent uninterpretability of prototype-based networks, which may be masked by human overreliance due to visual confirmation bias.
  • Figure 2: Backdoor attacks on a PIP-Net model classifying malignant skin lesions with 85% accuracy. The naive backdoor attack could be detected by a visual inspection of explanations, where one prototype highlights the trigger. Below is a schema of the prototype-based model's architecture with a highlighted point of failure. We exploit the vulnerabilities of latent-based models ($\theta_f$), considering two adversarial scenarios: disguising the attack when the model provides an original explanation for a new prediction, and a red herring that further manipulates the explanation, naturally covering up the reason why the prediction changed from benign to malignant.
  • Figure 3: Prototype substitution
  • Figure 4: We experiment with medical imaging as an example of a high-stakes application that requires machine learning interpretability and security. (A) Exemplary biases found in medical images used for skin lesion diagnosis (top) and bone abnormality detection (bottom). These are learned by (interpretable) deep learning models, often influencing their predictions (cf. spurious correlations, shortcut learning). (B) Exemplary triggers that can be used to embed a backdoor into (interpretable) deep learning models.
  • Figure 5: Backdoor attack
  • ...and 9 more figures