Table of Contents
Fetching ...

Evolving Interpretable Visual Classifiers with Large Language Models

Mia Chiquier, Utkarsh Mall, Carl Vondrick

TL;DR

The paper tackles the interpretability gap in vision-language classifiers by learning discrete, human-interpretable attribute sets per class through an evolutionary framework in which an open LLM proposes mutations guided by past performance. A concept bottleneck model aggregates attribute scores using a CLIP-based scorer, enabling open-vocabulary classification without class-name priors. Empirical results on fine-grained iNaturalist and novel Kiki-Bouba concepts demonstrate substantial improvements over baselines and provide a mechanism to audit dataset bias via interpretable attributes. The approach offers practical benefits for trust, explainability, and bias analysis in specialized domains, with potential limitations tied to the biases of the underlying LLMs.

Abstract

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

Evolving Interpretable Visual Classifiers with Large Language Models

TL;DR

The paper tackles the interpretability gap in vision-language classifiers by learning discrete, human-interpretable attribute sets per class through an evolutionary framework in which an open LLM proposes mutations guided by past performance. A concept bottleneck model aggregates attribute scores using a CLIP-based scorer, enabling open-vocabulary classification without class-name priors. Empirical results on fine-grained iNaturalist and novel Kiki-Bouba concepts demonstrate substantial improvements over baselines and provide a mechanism to audit dataset bias via interpretable attributes. The approach offers practical benefits for trust, explainability, and bias analysis in specialized domains, with potential limitations tied to the biases of the underlying LLMs.

Abstract

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.
Paper Structure (15 sections, 4 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 4 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Learning Interpretable Classifiers. Can we find text attributes for a concept by looking at the images without their class names? LLM-Mutate is a framework that learns sets of maximally discriminative visual attributes per class without access to class names or any form of prior knowledge.
  • Figure 2: Method. LLM-Mutate is an evolutionary algorithm that learns sets of discrete language attributes per class. The mutation and cross-over operations, which are mechanisms to introduce new parameter hypotheses, are replaced by a large language model that uses in-context learning over past attributes and their scores to iteratively generate better attributes.
  • Figure 3: Attribute evolution. We show examples of the attribute evolution for both the pre-training and joint-training stages of learning. At the beginning, the first generated set of attributes have little to do with the class, and by the end of the joint-training, the learned attributes are specific to the Greenleaf Manzanita.
  • Figure 4: Qualitative Results. We show qualitative results for the iNaturalist Lichen family and a KikiBouba dataset. The results illustrate two sample images per class and the learned attributes. The learned attributes for the Lichen hardly refer to color, as this is a common feature to all Lichen, and instead focus on structural properties.
  • Figure 5: Predictions. We show three different prediction examples. For each example, we show our method's prediction (first column), as well as classification by description menon2022visual (CBD)'s prediction (second column), and CBD's attributes for the ground truth class (third column). For each column, we show the normalized probability per attribute. Below the input image, we show the probability distributions across classes for both our method and CBD. The results show that our learned attributes are more detailed and discriminative of the species within the family, compared to the description by classification (CBD) baseline. Furthermore, our method's class probability distributions tend to be more concentrated than CBD's.
  • ...and 3 more figures