Concept Bottleneck Models Without Predefined Concepts
Simon Schrodi, Julian Schur, Max Argus, Thomas Brox
TL;DR
This work addresses interpretability in concept bottleneck models by removing the need for predefined or human-annotated concepts. It introduces Unsupervised Concept Bottleneck Models (UCBMs), which learn a compact concept space via dictionary learning from a pretrained model's activations and pair it with an interpretable, sparse classifier that uses an input-dependent gating mechanism to limit concept usage across all classes. The approach yields competitive performance with dramatically higher sparsity (e.g., ~0.7% of concepts per input on ImageNet) and provides explainable decisions at the concept level; it also demonstrates how large vision-language models can guide weight edits to fix misclassifications. Overall, UCBMs offer a scalable, interpretable alternative to black-box models and open the door to programmable model editing via external multimodal guidance.
Abstract
There has been considerable recent interest in interpretable concept-based models such as Concept Bottleneck Models (CBMs), which first predict human-interpretable concepts and then map them to output classes. To reduce reliance on human-annotated concepts, recent works have converted pretrained black-box models into interpretable CBMs post-hoc. However, these approaches predefine a set of concepts, assuming which concepts a black-box model encodes in its representations. In this work, we eliminate this assumption by leveraging unsupervised concept discovery to automatically extract concepts without human annotations or a predefined set of concepts. We further introduce an input-dependent concept selection mechanism that ensures only a small subset of concepts is used across all classes. We show that our approach improves downstream performance and narrows the performance gap to black-box models, while using significantly fewer concepts in the classification. Finally, we demonstrate how large vision-language models can intervene on the final model weights to correct model errors.
