Table of Contents
Fetching ...

ComFe: An Interpretable Head for Vision Transformers

Evelyn J. Mannix, Liam Hodgkinson, Howard Bondell

TL;DR

ComFe addresses the interpretability gap in large-scale vision models by introducing a scalable, interpretable-by-design image classification head for frozen Vision Transformers. It clusters patch embeddings into a compact set of image prototypes and matches them to learned class prototypes using a hierarchical von-Mises Fisher likelihood within a transformer decoder framework, enabling explanations via prototype exemplars. The training objective combines discriminative, clustering, and auxiliary losses, and the framework supports background prototypes to separate informative content from background. Evaluations on ImageNet-scale and robustness benchmarks show competitive accuracy with interpretable heads, improved generalisability and robustness over linear heads, and efficient training, making ComFe a practical option for interpretable, foundation-model-based vision systems. Moreover, it provides visualization and evaluation tools (exemplars, heatmaps) to interpret predictions and assess backbone interpretability with respect to the underlying ViT embeddings.

Abstract

Interpretable computer vision models explain their classifications through comparing the distances between the local embeddings of an image and a set of prototypes that represent the training data. However, these approaches introduce additional hyper-parameters that need to be tuned to apply to new datasets, scale poorly, and are more computationally intensive to train in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a highly scalable interpretable-by-design image classification head for pretrained Vision Transformers (ViTs) that can obtain competitive performance in comparison to comparable non-interpretable methods. To our knowledge, ComFe is the first interpretable head and unlike other interpretable approaches can be readily applied to large-scale datasets such as ImageNet-1K. Additionally, ComFe provides improved robustness and outperforms previous interpretable approaches on key benchmark datasets while using a consistent set of hyperparameters and without finetuning the pretrained ViT backbone. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction. Code is available at github.com/emannix/comfe-component-features.

ComFe: An Interpretable Head for Vision Transformers

TL;DR

ComFe addresses the interpretability gap in large-scale vision models by introducing a scalable, interpretable-by-design image classification head for frozen Vision Transformers. It clusters patch embeddings into a compact set of image prototypes and matches them to learned class prototypes using a hierarchical von-Mises Fisher likelihood within a transformer decoder framework, enabling explanations via prototype exemplars. The training objective combines discriminative, clustering, and auxiliary losses, and the framework supports background prototypes to separate informative content from background. Evaluations on ImageNet-scale and robustness benchmarks show competitive accuracy with interpretable heads, improved generalisability and robustness over linear heads, and efficient training, making ComFe a practical option for interpretable, foundation-model-based vision systems. Moreover, it provides visualization and evaluation tools (exemplars, heatmaps) to interpret predictions and assess backbone interpretability with respect to the underlying ViT embeddings.

Abstract

Interpretable computer vision models explain their classifications through comparing the distances between the local embeddings of an image and a set of prototypes that represent the training data. However, these approaches introduce additional hyper-parameters that need to be tuned to apply to new datasets, scale poorly, and are more computationally intensive to train in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a highly scalable interpretable-by-design image classification head for pretrained Vision Transformers (ViTs) that can obtain competitive performance in comparison to comparable non-interpretable methods. To our knowledge, ComFe is the first interpretable head and unlike other interpretable approaches can be readily applied to large-scale datasets such as ImageNet-1K. Additionally, ComFe provides improved robustness and outperforms previous interpretable approaches on key benchmark datasets while using a consistent set of hyperparameters and without finetuning the pretrained ViT backbone. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction. Code is available at github.com/emannix/comfe-component-features.
Paper Structure (55 sections, 26 equations, 14 figures, 14 tables, 1 algorithm)

This paper contains 55 sections, 26 equations, 14 figures, 14 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of ComFe. The image is first clustered into component features, which are then compared to class prototypes. This comparison between component features and class prototypes is used to identify the salient parts of the image for predicting the class Osteospermum, as shown by the class confidence heatmap in the final image.
  • Figure 2: The ComFe framework. A pretrained ViT network is applied to an input image $\mathbf{X}$, to produce patch embeddings $\mathbf{Z}$ and a class token. The ComFe clustering head $g_\theta$ is trained to cluster patch embeddings and produce image prototypes $\mathbf{P}$, which are compared to a set of learnt class prototypes $\mathbf{C}$ which represent the training data to identify the informative image regions and make a prediction. In constrast, a standard non-interpretable approach will train a linear head to make a prediction using the class token, and does not provide insight into the informative image regions or which training images led to a particular prediction.
  • Figure 3: Summary of the ComFe framework. Given an input image $\mathbf{X}$, the patch embeddings $\mathbf{Z}$ are obtained using a pretrained ViT backbone model, $f(\mathbf{X}) = \mathbf{Z}$. These patches are clustered into component features using a set of image prototypes $\mathbf{P}$, that are obtained from a transformer decoder clustering head as described in \ref{['eq:prototype_clustering_head_gen']}. In producing a classification, the image prototypes $\mathbf{P}$ are compared to the class prototypes $\mathbf{C}$, as described in \ref{['eq:y_given_Z_mt']} and \ref{['eq:y_given_z_pred']}. The variable $\nu$ represents the class prediction of a local patch, which is unknown as segmentation labels are assumed to not be available. The component features map (image prototypes) are visualised based on the image prototype $\mathbf{P}$ with the greatest likelihood of generating a particular patch in \ref{['eq:P_given_Z']}.
  • Figure 4: Visualizing ComFe explanations. Example ComFe predictions showcasing explainability, sampled from the validation images from the FGVC Aircraft, Stanford Cars and CUB200 datasets. The rows show the input images, component features (image prototypes), class prototype similarity and exemplars, and the class confidence heatmap for the final classification.
  • Figure S5: Class prototype exemplars. Image prototypes from the training data with smallest cosine distance to the class prototypes associated with each label in the FGCV Aircraft, Stanford Cars and CUB200 datasets.
  • ...and 9 more figures