Table of Contents
Fetching ...

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao

TL;DR

This paper introduces Prompt-CAM, a lightweight method to render pre-trained Vision Transformers interpretable for fine-grained analysis by learning $C$ class-specific prompts that steer attention toward discriminative traits. By freezing the ViT backbone and training only the prompts and a shared scorer, Prompt-CAM yields class-specific multi-head attention maps that localize traits and explain misclassifications, while remaining easy to implement (akin to Visual Prompt Tuning). Across 13 diverse fine-grained datasets and multiple backbones (DINO, DINOv2, BioCLIP), Prompt-CAM demonstrates strong trait localization, competitive accuracy, and superior faithfulness (via insertion/deletion metrics) compared to post-hoc explainers and some interpretable baselines. The approach supports variants (Shallow vs Deep prompts), allows trait ranking via greedy head-blurring, and extends to taxonomy-key discovery, highlighting its practical impact for biologically meaningful trait analysis and beyond. Overall, Prompt-CAM provides a simple, scalable, and effective path to interpretable VIts that emphasizes localized, trait-based reasoning in fine-grained domains.

Abstract

We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a ``free lunch,'' requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM.

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

TL;DR

This paper introduces Prompt-CAM, a lightweight method to render pre-trained Vision Transformers interpretable for fine-grained analysis by learning class-specific prompts that steer attention toward discriminative traits. By freezing the ViT backbone and training only the prompts and a shared scorer, Prompt-CAM yields class-specific multi-head attention maps that localize traits and explain misclassifications, while remaining easy to implement (akin to Visual Prompt Tuning). Across 13 diverse fine-grained datasets and multiple backbones (DINO, DINOv2, BioCLIP), Prompt-CAM demonstrates strong trait localization, competitive accuracy, and superior faithfulness (via insertion/deletion metrics) compared to post-hoc explainers and some interpretable baselines. The approach supports variants (Shallow vs Deep prompts), allows trait ranking via greedy head-blurring, and extends to taxonomy-key discovery, highlighting its practical impact for biologically meaningful trait analysis and beyond. Overall, Prompt-CAM provides a simple, scalable, and effective path to interpretable VIts that emphasizes localized, trait-based reasoning in fine-grained domains.

Abstract

We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a ``free lunch,'' requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM.
Paper Structure (21 sections, 10 equations, 22 figures, 5 tables)

This paper contains 21 sections, 10 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Illustration of Prompt-CAM. By learning class-specific prompts for a pre-trained Vision Transformer (ViT), Prompt-CAM enables multiple functionalities. (a) Prompt-CAM achieves fine-grained image classification using the output logits from the class-specific prompts. (b) Prompt-CAM enables trait localization by visualizing the multi-head attention maps queried by the true-class prompt. (c) Prompt-CAM identifies common traits shared between species by visualizing the attention maps queried by another-class prompt. (d) Prompt-CAM can identify the most discriminative traits per species (e.g., distinctive yellow chest and black neck for "Scott Oriole") by systematically masking out the least important attention heads. See \ref{['ss:vis']} for details.
  • Figure 2: Prompt-CAM vs. Visual Prompt Tuning (VPT). (a) VPT jia2022visual adds the prediction head on top of the [CLS] token's output, a default design to use ViTs for classification. (b) Prompt-CAM adds the prediction head on top of the injected prompts' outputs, making them class-specific to identify and localize traits.
  • Figure 3: Overview of Prompt Class Attention Map (Prompt-CAM). We explore two variants, given a pre-trained ViT with $N$ layers and a downstream task with $C$ classes: (a) Prompt-CAM-Deep: insert $C$ learnable "class-specific" tokens to the last layer's input and $C$ learnable "class-agnostic" tokens to each of the other $N-1$ layers' input; (b) Prompt-CAM-Shallow: insert $C$ learnable "class-specific" tokens to the first layer's input. During training, only the prompts and the prediction head are updated; the whole ViT is frozen.
  • Figure 4: Visualization of Prompt-CAM on different datasets. We show the top four attention maps (from left to right) per correctly classified test example triggered by the ground-truth classes.
  • Figure 5: Images misclassified by Prompt-CAM but correctly classified by Linear Probing. Species-specific traits—such as the red breast of "Red-breasted Grosbeak"—are barely visible in misclassified images while Linear Probing uses global features such as body shapes, poses, and backgrounds for correct predictions.
  • ...and 17 more figures