Table of Contents
Fetching ...

Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

Sonia Laguna, Ričards Marcinkevičs, Moritz Vandenhirtz, Julia E. Vogt

TL;DR

The notion of intervenability is formalised as a measure of the effectiveness of concept-based interventions and used to fine-tune black boxes, and it is demonstrated that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions.

Abstract

Recently, interpretable machine learning has re-explored concept bottleneck models (CBM). An advantage of this model class is the user's ability to intervene on predicted concept values, affecting the downstream output. In this work, we introduce a method to perform such concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. Furthermore, we formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.

Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

TL;DR

The notion of intervenability is formalised as a measure of the effectiveness of concept-based interventions and used to fine-tune black boxes, and it is demonstrated that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions.

Abstract

Recently, interpretable machine learning has re-explored concept bottleneck models (CBM). An advantage of this model class is the user's ability to intervene on predicted concept values, affecting the downstream output. In this work, we introduce a method to perform such concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. Furthermore, we formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.
Paper Structure (55 sections, 7 equations, 20 figures, 3 tables, 3 algorithms)

This paper contains 55 sections, 7 equations, 20 figures, 3 tables, 3 algorithms.

Figures (20)

  • Figure 1: A simplified, intuitive example: an image of a grizzly bear is wrongly identified as an otter. Our method allows performing a concept-based intervention and flip the predicted class. In order of appearance from left to right and top to bottom, the depicted concepts and classes are "fierce", "timid", "muscle", "walks", "otter", and "grizzly bear".
  • Figure 2: Three steps of the intervention procedure. (i) A probe $q_{\boldsymbol{\xi}}$ is trained to predict the concepts ${\bm{c}}$ from the activation vector ${\bm{z}}$. (ii) The representations are edited according to Equation \ref{['eq:inter_cf']}. (iii) The final prediction is updated to $\hat{y}'$ based on the edited representations ${\bm{z}}'$.
  • Figure 3: Intervention results w.r.t. target AUROC on the synthetic bottleneck data. We explore the performance under varying validation set sizes ($N_{\textrm{val}}$). Percentages correspond to the fractions of the original validation set. For CBMs, we report the results obtained by training on the validation (CBM val) and full training sets (CBM full). Interventions were performed on test data across ten simulations. Lines correspond to medians, and confidence bands are given by interquartile ranges.
  • Figure 4: Intervention results on the (a) synthetic incomplete, (b) AwA2, (c) CIFAR-10, and (d) MIMIC-CXR datasets w.r.t. target AUROC (top) and AUPR (bottom) across ten seeds.
  • Figure A.1: Schematic summary of concept-based instance-specific interventions on a black-box neural network. This work introduces an intervention procedure that, given concept values ${\bm{c}}'$, for an input ${\bm{x}}$, edits the network's activation vector ${\bm{z}}$ at an intermediate layer, replacing it with ${\bm{z}}'$ coherent with the given concepts. The intervention results in an updated prediction $\hat{y}'$.
  • ...and 15 more figures