Promises and Pitfalls of Black-Box Concept Learning Models
Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, Weiwei Pan
TL;DR
Prompted by the promise of interpretable decisions, the paper shows that soft concept representations in Concept Bottleneck Models and Concept Whitening leak information beyond predefined concepts, undermining interpretability. It systematically analyzes leakage under sequential training, latent-capacity augmentation, and decorrelation, with experiments on MNIST parity and synthetic datasets demonstrating that downstream predictors exploit leaked information. The study finds that existing mitigation strategies do not fully prevent entanglement and that hardening concepts can inadvertently still reveal data distribution. It concludes with propositions for mitigating leakage via mutual-information-aware training, concept refinement guided by domain experts, and cautious interpretation of concept-based explanations.
Abstract
Machine learning models that incorporate concept learning as an intermediate step in their decision making process can match the performance of black-box predictive models while retaining the ability to explain outcomes in human understandable terms. However, we demonstrate that the concept representations learned by these models encode information beyond the pre-defined concepts, and that natural mitigation strategies do not fully work, rendering the interpretation of the downstream prediction misleading. We describe the mechanism underlying the information leakage and suggest recourse for mitigating its effects.
