Table of Contents
Fetching ...

ProtoS-ViT: Visual foundation models for sparse self-explainable classifications

Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis

TL;DR

The paper addresses explainability gaps in prototypical-part networks by introducing a rigorous evaluation framework and identifying shortcomings in existing methods. It presents ProtoS-ViT, a self-explainable classifier that freezes a Vision Transformer backbone and learns a compact set of prototypes $J$, each of dimension $D$, with patch-level embeddings $g_i$ and cosine similarities $S_{i,j}=\cos \langle g_i, p_j\rangle$. A novel prototypical head computes per-prototype scores $h_j$ via depthwise convolutions with independent kernels and multi-scale paths, feeding a positive-weight linear classifier to produce class predictions. Across eight general datasets and biomedical tasks, ProtoS-ViT achieves competitive accuracy while improving explanation metrics such as correctness, completeness, consistency, and contrastivity, aided by the Hoyer-Square sparsity loss and the tanh loss, and validated with ablations and user studies.

Abstract

Prototypical networks aim to build intrinsically explainable models based on the linear summation of concepts. Concepts are coherent entities that we, as humans, can recognize and associate with a certain object or entity. However, important challenges remain in the fair evaluation of explanation quality provided by these models. This work first proposes an extensive set of quantitative and qualitative metrics which allow to identify drawbacks in current prototypical networks. It then introduces a novel architecture which provides compact explanations, outperforming current prototypical models in terms of explanation quality. Overall, the proposed architecture demonstrates how frozen pre-trained ViT backbones can be effectively turned into prototypical models for both general and domain-specific tasks, in our case biomedical image classifiers. Code is available at \url{https://github.com/hturbe/protosvit}.

ProtoS-ViT: Visual foundation models for sparse self-explainable classifications

TL;DR

The paper addresses explainability gaps in prototypical-part networks by introducing a rigorous evaluation framework and identifying shortcomings in existing methods. It presents ProtoS-ViT, a self-explainable classifier that freezes a Vision Transformer backbone and learns a compact set of prototypes , each of dimension , with patch-level embeddings and cosine similarities . A novel prototypical head computes per-prototype scores via depthwise convolutions with independent kernels and multi-scale paths, feeding a positive-weight linear classifier to produce class predictions. Across eight general datasets and biomedical tasks, ProtoS-ViT achieves competitive accuracy while improving explanation metrics such as correctness, completeness, consistency, and contrastivity, aided by the Hoyer-Square sparsity loss and the tanh loss, and validated with ablations and user studies.

Abstract

Prototypical networks aim to build intrinsically explainable models based on the linear summation of concepts. Concepts are coherent entities that we, as humans, can recognize and associate with a certain object or entity. However, important challenges remain in the fair evaluation of explanation quality provided by these models. This work first proposes an extensive set of quantitative and qualitative metrics which allow to identify drawbacks in current prototypical networks. It then introduces a novel architecture which provides compact explanations, outperforming current prototypical models in terms of explanation quality. Overall, the proposed architecture demonstrates how frozen pre-trained ViT backbones can be effectively turned into prototypical models for both general and domain-specific tasks, in our case biomedical image classifiers. Code is available at \url{https://github.com/hturbe/protosvit}.
Paper Structure (22 sections, 7 equations, 11 figures, 12 tables)

This paper contains 22 sections, 7 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Model architecture. The grey box depicts the similarity head. The pink box indicates the operations forming the prototypical head. Transparency of the elements aims to reflect the model's sparsity. Bottom: similarity maps interpolated from the similarity head.
  • Figure 2: Radar plot summarizing model performance both in terms of Accuracy (Acc.) as well as explainability quality with the following metrics Global Size (Glob. Size), and Local Size (Loc. Size), Completeness (Compl.), Correctness (Correct.), and Contrastivity (Contrast.), Consistency (Consist.), and Stability (Stabil.).
  • Figure 3: Score sheet for predictions on two random samples of the CUB dataset. Each row shows a prediction on a different sample. The first column indicates the position of the top four prototypes. Each subsequent column shows a prototype along with its importance towards the predicted class. Above the first column, we present the total score for the predicted class as well as how much of this score is explained by the prototypes shown in the figure.
  • Figure 4: Score sheet for predictions on three random samples of the CARS (a) and PETS (b) dataset. Each row shows a prediction on a different sample. The first column indicates the position of the top four prototypes. Each subsequent column shows a prototype along with its importance towards the predicted class. Above the first column, we present the total score for the predicted class as well as how much of this score is explained by the prototypes shown in the figure.
  • Figure 5: Random sample with activation for a prototype associated with the absence (a) and presence (b) of a penuomnia
  • ...and 6 more figures