ComFe: An Interpretable Head for Vision Transformers
Evelyn J. Mannix, Liam Hodgkinson, Howard Bondell
TL;DR
ComFe addresses the interpretability gap in large-scale vision models by introducing a scalable, interpretable-by-design image classification head for frozen Vision Transformers. It clusters patch embeddings into a compact set of image prototypes and matches them to learned class prototypes using a hierarchical von-Mises Fisher likelihood within a transformer decoder framework, enabling explanations via prototype exemplars. The training objective combines discriminative, clustering, and auxiliary losses, and the framework supports background prototypes to separate informative content from background. Evaluations on ImageNet-scale and robustness benchmarks show competitive accuracy with interpretable heads, improved generalisability and robustness over linear heads, and efficient training, making ComFe a practical option for interpretable, foundation-model-based vision systems. Moreover, it provides visualization and evaluation tools (exemplars, heatmaps) to interpret predictions and assess backbone interpretability with respect to the underlying ViT embeddings.
Abstract
Interpretable computer vision models explain their classifications through comparing the distances between the local embeddings of an image and a set of prototypes that represent the training data. However, these approaches introduce additional hyper-parameters that need to be tuned to apply to new datasets, scale poorly, and are more computationally intensive to train in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a highly scalable interpretable-by-design image classification head for pretrained Vision Transformers (ViTs) that can obtain competitive performance in comparison to comparable non-interpretable methods. To our knowledge, ComFe is the first interpretable head and unlike other interpretable approaches can be readily applied to large-scale datasets such as ImageNet-1K. Additionally, ComFe provides improved robustness and outperforms previous interpretable approaches on key benchmark datasets while using a consistent set of hyperparameters and without finetuning the pretrained ViT backbone. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction. Code is available at github.com/emannix/comfe-component-features.
