Table of Contents
Fetching ...

Post-hoc Concept Bottleneck Models

Mert Yuksekgonul, Maggie Wang, James Zou

TL;DR

Post-hoc Concept Bottleneck Models address the interpretability-performance gap of concept bottlenecks by enabling post-hoc conversion of any pretrained model into a concept bottleneck using a fixed concept subspace learned via CAVs or multimodal descriptions, with a residual variant (PCBM-h) to recover accuracy. Concepts can be sourced from other datasets or language descriptions (e.g., CLIP, ConceptNet) to form richer bottlenecks, reducing annotation needs. The framework supports global edits by manipulating concept weights, and human-in-the-loop pruning demonstrates substantial gains under distribution shift without target-domain data. Overall, PCBMs achieve competitive accuracy across diverse tasks while offering interpretable, editable bottlenecks and practical data-efficiency advantages.

Abstract

Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.

Post-hoc Concept Bottleneck Models

TL;DR

Post-hoc Concept Bottleneck Models address the interpretability-performance gap of concept bottlenecks by enabling post-hoc conversion of any pretrained model into a concept bottleneck using a fixed concept subspace learned via CAVs or multimodal descriptions, with a residual variant (PCBM-h) to recover accuracy. Concepts can be sourced from other datasets or language descriptions (e.g., CLIP, ConceptNet) to form richer bottlenecks, reducing annotation needs. The framework supports global edits by manipulating concept weights, and human-in-the-loop pruning demonstrates substantial gains under distribution shift without target-domain data. Overall, PCBMs achieve competitive accuracy across diverse tasks while offering interpretable, editable bottlenecks and practical data-efficiency advantages.

Abstract

Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.
Paper Structure (16 sections, 2 equations, 9 figures, 9 tables)

This paper contains 16 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Post-hoc Concept Bottleneck Models. First, we learn the vectors in our concept bank. With the CAV approach, for each concept, e.g. stripes, we train a linear SVM to distinguish the embeddings of examples that contain the concept and use the vector normal to the boundary (CAV). When annotations are hard to obtain, we can leverage multimodal models and use the text encoder to map each concept to a vector. Next, we project the embeddings produced by the backbone onto the concept subspace defined by the set of vectors. We then train an interpretable predictor to classify the examples from their projections. When the concept library is incomplete, we can construct a PCBM-h by sequentially introducing a residual predictor that maps the embeddings to the target space.
  • Figure 2: Explaining Post-hoc CBMs. We report the top 3 largest weights in the linear layer for the shown classes. For instance, Blue Whitish Veils, Atypical Pigment Networks, and Irregular Streaks have large weights for classifying whether a skin lesion is malignant. These are consistent with dermatologists' findings menzies1996sensitivity.
  • Figure 3: User Study Interface. We train PCBMs on MetaShift scenarios, each with a distribution shift between the training and test datasets. The user selects concepts to prune from the model.
  • Figure 4: Residual component intervenes mostly when the confidence is low. Here, we look at the consistency between PCBM and PCBM-h predictions (i.e. whether both models make the same prediction). Namely, at each confidence level for the PCBM, we report the accuracy and consistency with the PCBM-h predictions. Overall, we see that PCBM-h is most likely to change the model prediction when the model is making a mistake, and otherwise, predictions are consistent.
  • Figure 5: Residual component intervenes only to fix mistakes. Here we show the number of mistakes, the number of predictions changed by PCBM-h, and the number of mistakes fixed by PCBM-h. We see that PCBM-h only changes the model predictions to fix model mistakes.
  • ...and 4 more figures