ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue; Qihan Huang; Haofei Zhang; Jingwen Hu; Jie Song; Mingli Song; Canghong Jin

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin

TL;DR

This paper addresses prototype distraction when transferring ProtoPNet to vision transformers (ViTs) by introducing ProtoPFormer, a dual-branch architecture with global prototypes on the class token and local prototypes on image tokens. It uses a foreground-preserving (FP) mask derived from attention rollout to concentrate local prototypes on foreground regions and a prototypical-part concentration (PPC) loss to enforce diverse, centralized prototypical parts, with final decisions made by combining global and local predictions. Empirical results on CUB, Dogs, and Cars across multiple ViT backbones show superior accuracy and clearer visual explanations compared with SOTA prototype-based baselines, aided by mutual correction between branches. The approach yields interpretable, faithful reasoning from both holistic and part-based perspectives, advancing the practical deployment of prototype-based XAI in ViT-based image recognition.

Abstract

Prototypical part network (ProtoPNet) has drawn wide attention and boosted many follow-up studies due to its self-explanatory property for explainable artificial intelligence (XAI). However, when directly applying ProtoPNet on vision transformer (ViT) backbones, learned prototypes have a "distraction" problem: they have a relatively high probability of being activated by the background and pay less attention to the foreground. The powerful capability of modeling long-term dependency makes the transformer-based ProtoPNet hard to focus on prototypical parts, thus severely impairing its inherent interpretability. This paper proposes prototypical part transformer (ProtoPFormer) for appropriately and effectively applying the prototype-based method with ViTs for interpretable image recognition. The proposed method introduces global and local prototypes for capturing and highlighting the representative holistic and partial features of targets according to the architectural characteristics of ViTs. The global prototypes are adopted to provide the global view of objects to guide local prototypes to concentrate on the foreground while eliminating the influence of the background. Afterwards, local prototypes are explicitly supervised to concentrate on their respective prototypical visual parts, increasing the overall interpretability. Extensive experiments demonstrate that our proposed global and local prototypes can mutually correct each other and jointly make final decisions, which faithfully and transparently reason the decision-making processes associatively from the whole and local perspectives, respectively. Moreover, ProtoPFormer consistently achieves superior performance and visualization results over the state-of-the-art (SOTA) prototype-based baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

TL;DR

Abstract

Paper Structure (28 sections, 11 equations, 12 figures, 7 tables)

This paper contains 28 sections, 11 equations, 12 figures, 7 tables.

Introduction
Related Work
Interpretability with CNNs
Interpretability with ViTs
Preliminaries
ProtoPFormer
Concentration on the Foreground
Concentration on Prototypical Parts
Experiments
Experimental Settings
Performance Comparison
Visualization Analysis
Ablation Study
Conclusion
Detailed Description of ProtoPNet
...and 13 more sections

Figures (12)

Figure 1: Visual comparison of prototypes on an example image between a CNN-based ProtoPNet (ResNet34 he2016resnet) and a ViT-based ProtoPNet (DeiT-Ti touvron2021deit), and our ProtoPFormer (DeiT-Ti).
Figure 2: Illustration of ProtoPFormer for image recognition interpretation. The global branch provides guidance for the local branch with the FP mask. The strategy of mutual correction and joint decision makes them contribute complementarily to final predictions, capitalizing on the built-in architectures in ViTs. The loss propagation of $\mathcal{L}_{\mathrm{CE}}$ is omitted for simplicity.
Figure 3: The reasoning process of our ProtoPFomer in classifying the species of a bird with DeiT-Ti, where $\bm{\oplus}$ denotes summation of similarity scores.
Figure 4: Visual demonstration of the two most activated local prototypes in heat maps and bounding boxes on example images (randomly chosen from the CUB and Dog datasets) of five prototype-based baselines and ProtoPFormer with DeiT-S.
Figure 5: Heat maps of the same local prototypes on different examples from the training and test set with DeiT-S.
...and 7 more figures

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

TL;DR

Abstract

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (12)