Table of Contents
Fetching ...

PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck

Thang M. Pham, Peijie Chen, Tin Nguyen, Seunghyun Yoon, Trung Bui, Anh Totti Nguyen

TL;DR

PEEB tackles fine-grained classification with an explainable, editable, part-based bottleneck that grounds textual part descriptors to detected image parts using OWL-ViT. By removing reliance on class names in prompts and enabling descriptor editing, it achieves strong generalized zero-shot performance and competitive supervised results, outperforming CLIP-based and descriptor-only approaches in GZSL and ZSL settings. The approach employs a two-stage contrastive pretraining on Bird-11K and subsequent finetuning on downstream tasks, with an open-vocabulary detector and a ground-truth-like part-to-descriptor matching that yields interpretable predictions. This work also provides large-scale Bird-11K and Dog-140 datasets and demonstrates transferability to dog identification, highlighting practical impact for scalable, interactive, fine-grained recognition across domains.

Abstract

CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (~10x in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.

PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck

TL;DR

PEEB tackles fine-grained classification with an explainable, editable, part-based bottleneck that grounds textual part descriptors to detected image parts using OWL-ViT. By removing reliance on class names in prompts and enabling descriptor editing, it achieves strong generalized zero-shot performance and competitive supervised results, outperforming CLIP-based and descriptor-only approaches in GZSL and ZSL settings. The approach employs a two-stage contrastive pretraining on Bird-11K and subsequent finetuning on downstream tasks, with an open-vocabulary detector and a ground-truth-like part-to-descriptor matching that yields interpretable predictions. This work also provides large-scale Bird-11K and Dog-140 datasets and demonstrates transferability to dog identification, highlighting practical impact for scalable, interactive, fine-grained recognition across domains.

Abstract

CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (~10x in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.
Paper Structure (82 sections, 10 equations, 20 figures, 19 tables, 1 algorithm)

This paper contains 82 sections, 10 equations, 20 figures, 19 tables, 1 algorithm.

Figures (20)

  • Figure 1: Existing explanations are either (a) textual but at the image level; or (b) part-level but not textual. Combining the best of both worlds, PEEB (c) first matches each detected object part to a text descriptor, then uses the part-level matching scores to classify the image.
  • Figure 2: Given an input image (a) from an unseen class of Eastern Bluebird, PEEB misclassifies it into Indigo Bunting (b), a visually similar blue bird in CUB-200 (d). To add a new class for Eastern Bluebird to the 200-class list that PEEB considers when classifying, we clone the 12 textual descriptors of Indigo Bunting (b) and edit (- -$\blacktriangleright$) the descriptor of throat and wings (c) to reflect their identification features described on AllAboutBirds.org ("Male Eastern Bluebirds are vivid, deep blue above and rusty or brick-red on the throat and breast"). After the edit, PEEB correctly predicts the input image into Eastern Bluebird (softmax: 0.0445) out of 201 classes (c). That is, the dot product between the wings text descriptor and the same orange region increases from 0.57 to 0.74.
  • Figure 3: During inference, 12 visual part embeddings with the highest cosine similarity with encoded part names are selected (a). These visual part embeddings are then mapped ($\longrightarrow$) to bounding boxes via Box MLP. Simultaneously, the same embeddings are forwarded to the Part MLP and its outputs are then matched (b) with textual part descriptors to make classification predictions ($\longrightarrow$). \ref{['fig:xclip_overview']} shows a more detailed view of the same process.
  • Figure 4: With original descriptors, M&V menon2023visual correctly classifies the input image into Blue Jay(a). Yet, interestingly, when randomly swapping the descriptors of this class with those of other classes (b), M&V's top-1 prediction remains unchanged, suggesting that the class names in the prompt (e.g., "A photo of {class name}") have the most influence over the prediction (not the expressive descriptors). In contrast, PEEB changes its top-1 prediction from Blue Jay(c) to Least Tern(d) when the descriptors are randomized.
  • Figure 5: PEEB classifies this Dogs-120 image into Alaskan Malamute (softmax: 0.199) due to the matching between the image regions and associated textual part descriptors. In contrast, the explanation shows that the input image is not classified into Cairn Terrier mostly because its ears and body regions do not match the text descriptors, i.e., dot products are 0.000 and 0.000, respectively. See \ref{['sec:qualitative_examples']} for more qualitative examples.
  • ...and 15 more figures