CLIP-Free, Label-Free, Zero-Shot Concept Bottleneck Models
Fawaz Sammani, Jonas Fischer, Nikos Deligiannis
TL;DR
This work removes reliance on CLIP from Concept Bottleneck Models by introducing TextUnlock, a CLIP-free, label-free method to convert any frozen visual classifier into a zero-shot CBM. It aligns the classifier's output distribution with a vision–language counterpart using a trainable MLP that maps visual features into a text-embedding space, preserving the original predictions. The approach enables zero-shot concept discovery and concept-to-class mapping using a large, general concept bank, and extends to zero-shot image captioning with prefix-tuned language generation, achieving state-of-the-art results on ImageNet and several domain datasets. Its data-efficient, architecture-agnostic design preserves the classifier’s reasoning while delivering interpretable concept activations and flexible on-the-fly concept sets.
Abstract
Concept Bottleneck Models (CBMs) map dense, high-dimensional feature representations into a set of human-interpretable concepts which are then combined linearly to make a prediction. However, modern CBMs rely on the CLIP model to establish a mapping from dense feature representations to textual concepts, and it remains unclear how to design CBMs for models other than CLIP. Methods that do not use CLIP instead require manual, labor intensive annotation to associate feature representations with concepts. Furthermore, all CBMs necessitate training a linear classifier to map the extracted concepts to class labels. In this work, we lift all three limitations simultaneously by proposing a method that converts any frozen visual classifier into a CBM without requiring image-concept labels (label-free), without relying on the CLIP model (CLIP-free), and by deriving the linear classifier in a zero-shot manner. Our method is formulated by aligning the original classifier's distribution (over discrete class indices) with its corresponding vision-language counterpart distribution derived from textual class names, while preserving the classifier's performance. The approach requires no ground-truth image-class annotations, and is highly data-efficient and preserves the classifier's reasoning process. Applied and tested on over 40 visual classifiers, our resulting CLIP-free, zero-shot CBM sets a new state of the art, surpassing even supervised CLIP-based CBMs. Finally, we also show that our method can be used for zero-shot image captioning, outperforming existing methods based on CLIP, and achieving state of the art results.
