Table of Contents
Fetching ...

Promises and Pitfalls of Black-Box Concept Learning Models

Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, Weiwei Pan

TL;DR

Prompted by the promise of interpretable decisions, the paper shows that soft concept representations in Concept Bottleneck Models and Concept Whitening leak information beyond predefined concepts, undermining interpretability. It systematically analyzes leakage under sequential training, latent-capacity augmentation, and decorrelation, with experiments on MNIST parity and synthetic datasets demonstrating that downstream predictors exploit leaked information. The study finds that existing mitigation strategies do not fully prevent entanglement and that hardening concepts can inadvertently still reveal data distribution. It concludes with propositions for mitigating leakage via mutual-information-aware training, concept refinement guided by domain experts, and cautious interpretation of concept-based explanations.

Abstract

Machine learning models that incorporate concept learning as an intermediate step in their decision making process can match the performance of black-box predictive models while retaining the ability to explain outcomes in human understandable terms. However, we demonstrate that the concept representations learned by these models encode information beyond the pre-defined concepts, and that natural mitigation strategies do not fully work, rendering the interpretation of the downstream prediction misleading. We describe the mechanism underlying the information leakage and suggest recourse for mitigating its effects.

Promises and Pitfalls of Black-Box Concept Learning Models

TL;DR

Prompted by the promise of interpretable decisions, the paper shows that soft concept representations in Concept Bottleneck Models and Concept Whitening leak information beyond predefined concepts, undermining interpretability. It systematically analyzes leakage under sequential training, latent-capacity augmentation, and decorrelation, with experiments on MNIST parity and synthetic datasets demonstrating that downstream predictors exploit leaked information. The study finds that existing mitigation strategies do not fully prevent entanglement and that hardening concepts can inadvertently still reveal data distribution. It concludes with propositions for mitigating leakage via mutual-information-aware training, concept refinement guided by domain experts, and cautious interpretation of concept-based explanations.

Abstract

Machine learning models that incorporate concept learning as an intermediate step in their decision making process can match the performance of black-box predictive models while retaining the ability to explain outcomes in human understandable terms. However, we demonstrate that the concept representations learned by these models encode information beyond the pre-defined concepts, and that natural mitigation strategies do not fully work, rendering the interpretation of the downstream prediction misleading. We describe the mechanism underlying the information leakage and suggest recourse for mitigating its effects.

Paper Structure

This paper contains 12 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Top: The top two PCA dimensions of a curated subset of MNIST. Bottom: Activations for the two concepts of "is 4" and "is 5" for each observation. Colors in both indicate each observation's location along the first PCA dimension. Note that the "is 4" concept preserves the ordering of the colors, and thus encoding the first PCA dimension (Pearson correlation of -0.72). The "is 5" concept also helps preserve the first PCA dimension (Pearson correlation of -0.67). Together these soft concept representations encode the data distribution to allow for good task performance though both concepts are independent of the task.
  • Figure 2: A synthetic dataset of two features, three concepts, and a somewhat complex task boundary that cuts across concepts.
  • Figure 3: The same dataset as Figure \ref{['fig:features_concepts_task']} but with colors indicating task labels. This is the dataset prior to transformations present in Figures \ref{['fig:concept_preds_to_task']} and \ref{['fig:sigmoided']}.
  • Figure 4: Concept activations output from the features-to-concepts model. Because the three concepts are mutually exclusive, only two concept activations are independent. Hence, the three concept activations can be projected onto a plane without loss of information. Left: One angle showing the dataset. Right: A second angle showing the data lying on a plane.
  • Figure 5: The first two PCA dimensions of the concept activations, along with the synthetic task boundary and learned task boundary in the same PCA space. The task boundary and dataset are only modestly transformed from the original dataset seen in Figure \ref{['fig:features_task']}, and the learned task boundary appears similar to that learned directly from the features (Figure \ref{['fig:features_to_task']}). The concept predictions $\to$ task model achieves 95.9% accuracy.
  • ...and 8 more figures