Table of Contents
Fetching ...

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Sukrut Rao, Sweta Mahajan, Moritz Böhle, Bernt Schiele

TL;DR

DN-CBM inverts traditional concept bottleneck models by first discovering concepts learned by CLIP with sparse autoencoders, then naming these concepts via CLIP text embeddings, and finally using the named concepts as a fixed bottleneck for downstream classification. The approach requires no task-specific concept labels and demonstrates task-agnostic, interpretable concept discovery that generalizes across datasets such as ImageNet, Places365, and CIFAR. Empirical results show semantically meaningful concepts, coherent naming, and competitive classification performance, with interpretable explanations at both the local and global levels and effective concept-based interventions. The work emphasizes scalability and robustness of interpretable CBMs and points to future improvements in expanding the concept space and vocabulary to capture finer-grained, less spurious concepts.

Abstract

Concept Bottleneck Models (CBMs) have recently been proposed to address the 'black-box' problem of deep neural networks, by first mapping images to a human-understandable concept space and then linearly combining concepts for classification. Such models typically require first coming up with a set of concepts relevant to the task and then aligning the representations of a feature extractor to map to these concepts. However, even with powerful foundational feature extractors like CLIP, there are no guarantees that the specified concepts are detectable. In this work, we leverage recent advances in mechanistic interpretability and propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm: instead of pre-selecting concepts based on the downstream classification task, we use sparse autoencoders to first discover concepts learnt by the model, and then name them and train linear probes for classification. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model. We perform a comprehensive evaluation across multiple datasets and CLIP architectures and show that our method yields semantically meaningful concepts, assigns appropriate names to them that make them easy to interpret, and yields performant and interpretable CBMs. Code available at https://github.com/neuroexplicit-saar/discover-then-name.

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

TL;DR

DN-CBM inverts traditional concept bottleneck models by first discovering concepts learned by CLIP with sparse autoencoders, then naming these concepts via CLIP text embeddings, and finally using the named concepts as a fixed bottleneck for downstream classification. The approach requires no task-specific concept labels and demonstrates task-agnostic, interpretable concept discovery that generalizes across datasets such as ImageNet, Places365, and CIFAR. Empirical results show semantically meaningful concepts, coherent naming, and competitive classification performance, with interpretable explanations at both the local and global levels and effective concept-based interventions. The work emphasizes scalability and robustness of interpretable CBMs and points to future improvements in expanding the concept space and vocabulary to capture finer-grained, less spurious concepts.

Abstract

Concept Bottleneck Models (CBMs) have recently been proposed to address the 'black-box' problem of deep neural networks, by first mapping images to a human-understandable concept space and then linearly combining concepts for classification. Such models typically require first coming up with a set of concepts relevant to the task and then aligning the representations of a feature extractor to map to these concepts. However, even with powerful foundational feature extractors like CLIP, there are no guarantees that the specified concepts are detectable. In this work, we leverage recent advances in mechanistic interpretability and propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm: instead of pre-selecting concepts based on the downstream classification task, we use sparse autoencoders to first discover concepts learnt by the model, and then name them and train linear probes for classification. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model. We perform a comprehensive evaluation across multiple datasets and CLIP architectures and show that our method yields semantically meaningful concepts, assigns appropriate names to them that make them easy to interpret, and yields performant and interpretable CBMs. Code available at https://github.com/neuroexplicit-saar/discover-then-name.
Paper Structure (29 sections, 7 equations, 21 figures, 6 tables)

This paper contains 29 sections, 7 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: Automated concept extraction and naming to construct task-agnostic concept bottlenecks. Our approach consists of three steps: (1) we use a sparse autoencoder to extract disentangled concepts from CLIP feature extractors, (2) automatically name extracted concepts by matching the dictionary vectors with the closest text embedding in CLIP space from a concept set of texts, and (3) use this named concept extractor layer as a concept bottleneck to create concept bottleneck models for classification on different datasets. In the example shown, the concepts 'colorful', 'spheres', and 'fence' are extracted from the image with high strengths, resulting in a prediction of 'ball pit'. For details, see \ref{['fig:method']} and \ref{['sec:method']}.
  • Figure 2: Overview. Our approach consists of three steps. (1) We train a sparse autoencoder to extract disentangled concepts from a CLIP vision backbone. The autencoder is trained on a large dataset $\mathcal{D}_{extract}$ to reconstruct CLIP features using a linear combination of encoded concepts, which are optimized to be sparse using $L_1$ sparsity. The weights of the decoder can be interpreted as dictionary vectors whose linear sum with concept strengths reconstructs the original feature (\ref{['sec:method:extract']}). (2) We use a large concept set of texts $\mathcal{V}$ to name each extracted concept, by finding the text from the set whose embedding has the highest cosine similarity to concept's dictionary vector (\ref{['sec:method:name']}). (3) We use the extracted and named concepts as a concept bottleneck layer, and train linear classifiers to construct inherently interpretable concept bottleneck models across downstream datasets $\mathcal{D}_{classify}$ using the same bottleneck layer (\ref{['sec:method:cbm']}).
  • Figure 3: Task-agnosticity of concept extraction. We show examples of named concepts (blocks) and top images activating them from four datasets (rows). We find that the images activating the concept are highly consistent with the concept name across datasets (e.g. the 'asleep' concept yields images across different species), despite not using these datasets for extraction and naming, showing the robustness of our approach.
  • Figure 4: User study on concept accuracy.Left: We evaluate the semantic consistency of concepts for nodes with high, intermediate, and low alignment with the text embeddings of the name assigned to them, both for nodes from our SAE (green) and the CLIP features (orange). We find that the concepts from the SAE are significantly more semantically consistent than CLIP features, and the consistency increases with alignment. The poor performance of the 'low alignment' group suggests that some nodes do not correspond to a consistent human interpretable concept. Right: We plot the scores for semantic consistency against name accuracy from human evaluators, both for nodes from our SAE (green) and the CLIP features (orange). We find that compared to the baseline, our SAE nodes are generally more consistent and accurately named.
  • Figure 5: Impact of vocabulary. We show examples of pairs of concepts that, despite being assigned to the same coarse grained name (e.g. left: 'tree'), correspond to distinct fine-grained concepts. Better names that can distinguishing such concepts are assigned if added to the vocabulary (e.g. 'christmas tree' for the first concept, and 'tree in field' for the second). On the other hand, removing the assigned name from the vocabulary leads to worse names being assigned (e.g. 'ornaments' and 'branches'), which shows that the granularity of the vocabulary can impact name accuracy.
  • ...and 16 more figures