Table of Contents
Fetching ...

CLIP-Free, Label-Free, Zero-Shot Concept Bottleneck Models

Fawaz Sammani, Jonas Fischer, Nikos Deligiannis

TL;DR

This work removes reliance on CLIP from Concept Bottleneck Models by introducing TextUnlock, a CLIP-free, label-free method to convert any frozen visual classifier into a zero-shot CBM. It aligns the classifier's output distribution with a vision–language counterpart using a trainable MLP that maps visual features into a text-embedding space, preserving the original predictions. The approach enables zero-shot concept discovery and concept-to-class mapping using a large, general concept bank, and extends to zero-shot image captioning with prefix-tuned language generation, achieving state-of-the-art results on ImageNet and several domain datasets. Its data-efficient, architecture-agnostic design preserves the classifier’s reasoning while delivering interpretable concept activations and flexible on-the-fly concept sets.

Abstract

Concept Bottleneck Models (CBMs) map dense, high-dimensional feature representations into a set of human-interpretable concepts which are then combined linearly to make a prediction. However, modern CBMs rely on the CLIP model to establish a mapping from dense feature representations to textual concepts, and it remains unclear how to design CBMs for models other than CLIP. Methods that do not use CLIP instead require manual, labor intensive annotation to associate feature representations with concepts. Furthermore, all CBMs necessitate training a linear classifier to map the extracted concepts to class labels. In this work, we lift all three limitations simultaneously by proposing a method that converts any frozen visual classifier into a CBM without requiring image-concept labels (label-free), without relying on the CLIP model (CLIP-free), and by deriving the linear classifier in a zero-shot manner. Our method is formulated by aligning the original classifier's distribution (over discrete class indices) with its corresponding vision-language counterpart distribution derived from textual class names, while preserving the classifier's performance. The approach requires no ground-truth image-class annotations, and is highly data-efficient and preserves the classifier's reasoning process. Applied and tested on over 40 visual classifiers, our resulting CLIP-free, zero-shot CBM sets a new state of the art, surpassing even supervised CLIP-based CBMs. Finally, we also show that our method can be used for zero-shot image captioning, outperforming existing methods based on CLIP, and achieving state of the art results.

CLIP-Free, Label-Free, Zero-Shot Concept Bottleneck Models

TL;DR

This work removes reliance on CLIP from Concept Bottleneck Models by introducing TextUnlock, a CLIP-free, label-free method to convert any frozen visual classifier into a zero-shot CBM. It aligns the classifier's output distribution with a vision–language counterpart using a trainable MLP that maps visual features into a text-embedding space, preserving the original predictions. The approach enables zero-shot concept discovery and concept-to-class mapping using a large, general concept bank, and extends to zero-shot image captioning with prefix-tuned language generation, achieving state-of-the-art results on ImageNet and several domain datasets. Its data-efficient, architecture-agnostic design preserves the classifier’s reasoning while delivering interpretable concept activations and flexible on-the-fly concept sets.

Abstract

Concept Bottleneck Models (CBMs) map dense, high-dimensional feature representations into a set of human-interpretable concepts which are then combined linearly to make a prediction. However, modern CBMs rely on the CLIP model to establish a mapping from dense feature representations to textual concepts, and it remains unclear how to design CBMs for models other than CLIP. Methods that do not use CLIP instead require manual, labor intensive annotation to associate feature representations with concepts. Furthermore, all CBMs necessitate training a linear classifier to map the extracted concepts to class labels. In this work, we lift all three limitations simultaneously by proposing a method that converts any frozen visual classifier into a CBM without requiring image-concept labels (label-free), without relying on the CLIP model (CLIP-free), and by deriving the linear classifier in a zero-shot manner. Our method is formulated by aligning the original classifier's distribution (over discrete class indices) with its corresponding vision-language counterpart distribution derived from textual class names, while preserving the classifier's performance. The approach requires no ground-truth image-class annotations, and is highly data-efficient and preserves the classifier's reasoning process. Applied and tested on over 40 visual classifiers, our resulting CLIP-free, zero-shot CBM sets a new state of the art, surpassing even supervised CLIP-based CBMs. Finally, we also show that our method can be used for zero-shot image captioning, outperforming existing methods based on CLIP, and achieving state of the art results.

Paper Structure

This paper contains 28 sections, 2 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Overview of our proposed TextUnlock.(a) The process of training the MLP mapping between vision and text space with pseudocode given in Appendix Section \ref{['app:pseudocode']}. (b) The process of inference with the adapted visual classifier. The text encoder acts as weight generator for a linear classifier. indicates that the module is frozen, while indicates trainable.
  • Figure 1: Limitations of our method in wrong semantic concept association
  • Figure 2: Building zero-shot CBMs for any pretrained classifier. We first perform (a) concept discovery, followed by (b) building the concepts-to-class classifier in a zero-shot manner, which results in (c) our final CBM. Note that the concept bank only needs to be encoded once.
  • Figure 2: The process used to generate zero-shot captions using any pretrained language decoder (e.g., GPT-2). The process is shown for the first timestep ($ts=1$) and first iteration ($j=1$) with a hard prompt set as "an image of a". We apply prefix tuning while keeping the language decoder frozen, generating text that maximizes the similarity with the visual features.
  • Figure 3: Qualitative examples of our zero-shot CBMs. We show the top-detected concepts, each with their corresponding importance score to the on the x-axis.
  • ...and 2 more figures