Table of Contents
Fetching ...

Tell me why: Visual foundation models as self-explainable classifiers

Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis

TL;DR

ProtoFM tackles the interpretability bottleneck of visual foundation models by freezing a powerful VFM and training a lightweight prototypical head (~1M parameters). It introduces a student-teacher prototype scheme, cosine-based prototype matching, and a multi-task loss (assignment, alignment, contrastive, sparsity, and classification) to produce faithful, localized explanations. Through FunnyBirds-based evaluation and a wide ablation study, ProtoFM demonstrates state-of-the-art interpretability among prototypical-part models while maintaining competitive classification performance on CUB and CARS, and reasonable results on domain-specific data like RSNA. The approach offers practical benefits for deploying interpretable vision systems with limited trainable parameters and paves the way for richer, multi-modal explanations including textual descriptions.

Abstract

Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.

Tell me why: Visual foundation models as self-explainable classifiers

TL;DR

ProtoFM tackles the interpretability bottleneck of visual foundation models by freezing a powerful VFM and training a lightweight prototypical head (~1M parameters). It introduces a student-teacher prototype scheme, cosine-based prototype matching, and a multi-task loss (assignment, alignment, contrastive, sparsity, and classification) to produce faithful, localized explanations. Through FunnyBirds-based evaluation and a wide ablation study, ProtoFM demonstrates state-of-the-art interpretability among prototypical-part models while maintaining competitive classification performance on CUB and CARS, and reasonable results on domain-specific data like RSNA. The approach offers practical benefits for deploying interpretable vision systems with limited trainable parameters and paves the way for richer, multi-modal explanations including textual descriptions.

Abstract

Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.

Paper Structure

This paper contains 22 sections, 17 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Model architecture. The model is composed of a frozen VFM followed by a projector and classification head in order to classify images from a set of learned concepts.
  • Figure 2: Score sheet for predictions on three random samples of the CUB dataset. Each row shows a prediction on a different sample. The first column indicates the position of the top four prototypes. Each subsequent column shows a prototype along with its importance towards the predicted class. The total score for the predicted class and the SEC metric are presented above the first column.
  • Figure 3: Nearest patches to four prototypes; two for the CARS dataset (orange and blue boxes) and two for CUB (yellow and green boxes). The predicted class along the max similarity between the prototype of interest and the patches are indicated above each image.
  • Figure 4: Radar plot summarizing model performance both in terms of Accuracy (Acc.) as well as explainability quality with the following metrics Global Size (Glob. Size), and Local Size (Loc. Size), Completeness (Compl.), Correctness (Correct.), and Contrastivity (Contrast.), Consistency (Consist.), and Stability (Stabil.).
  • Figure 5: Examples of random selection for each prototype of 50 samples where this prototype plays a role toward the model's prediction.
  • ...and 5 more figures