Table of Contents
Fetching ...

Interpretable Image Classification via Non-parametric Part Prototype Learning

Zhijie Zhu, Lei Fan, Maurice Pagnucco, Yang Song

TL;DR

This work tackles the limited interpretability of ProtoPNets due to repetitive part explanations by introducing non-parametric part-prototypes learned per class via clustering of backbone features, validated on fine-grained datasets. It employs a two-stage training with foundation Vision Transformers (e.g., $\text{ViT}$ backbones pre-trained with self-distillation) and a prototype-anchored fine-tuning strategy, including a Patch-Prototype Distance Contrastive loss and Block Expansion for efficient feature space grounding. Key contributions include a robust, diverse set of part-prototypes, an optimal-transport-based assignment with entropic regularization, and two new metrics—Distinctiveness and Comprehensiveness—to quantify explanation diversity and foreground coverage. Empirically, the method achieves competitive classification accuracy while delivering richer, more holistic explanations, demonstrating practical impact for trustworthy, interpretable image classification and adaptable concept-based explanations.

Abstract

Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: https://github.com/zijizhu/proto-non-param.

Interpretable Image Classification via Non-parametric Part Prototype Learning

TL;DR

This work tackles the limited interpretability of ProtoPNets due to repetitive part explanations by introducing non-parametric part-prototypes learned per class via clustering of backbone features, validated on fine-grained datasets. It employs a two-stage training with foundation Vision Transformers (e.g., backbones pre-trained with self-distillation) and a prototype-anchored fine-tuning strategy, including a Patch-Prototype Distance Contrastive loss and Block Expansion for efficient feature space grounding. Key contributions include a robust, diverse set of part-prototypes, an optimal-transport-based assignment with entropic regularization, and two new metrics—Distinctiveness and Comprehensiveness—to quantify explanation diversity and foreground coverage. Empirically, the method achieves competitive classification accuracy while delivering richer, more holistic explanations, demonstrating practical impact for trustworthy, interpretable image classification and adaptable concept-based explanations.

Abstract

Classifying images with an interpretable decision-making process is a long-standing problem in computer vision. In recent years, Prototypical Part Networks has gained traction as an approach for self-explainable neural networks, due to their ability to mimic human visual reasoning by providing explanations based on prototypical object parts. However, the quality of the explanations generated by these methods leaves room for improvement, as the prototypes usually focus on repetitive and redundant concepts. Leveraging recent advances in prototype learning, we present a framework for part-based interpretable image classification that learns a set of semantically distinctive object parts for each class, and provides diverse and comprehensive explanations. The core of our method is to learn the part-prototypes in a non-parametric fashion, through clustering deep features extracted from foundation vision models that encode robust semantic information. To quantitatively evaluate the quality of explanations provided by ProtoPNets, we introduce Distinctiveness Score and Comprehensiveness Score. Through evaluation on CUB-200-2011, Stanford Cars and Stanford Dogs datasets, we show that our framework compares favourably against existing ProtoPNets while achieving better interpretability. Code is available at: https://github.com/zijizhu/proto-non-param.

Paper Structure

This paper contains 14 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: a. Existing methods wang_interpretable_2021huang_evaluation_2023 encode various assumptions as regularizations to guide prototype learning, but often fail to diversify the concepts learned by each prototype. Here multiple part-prototypes attend to the same image region, thereby limiting their interpretability. In contrast b. Our method partitions the feature space into semantically distinct clusters, and updates each prototype with the empirical mean of its respective cluster, enabling each prototype to learn semantically different concepts.
  • Figure 2: The architecture of our learning framework. a. Similar to prototype-based classification, we define $K$ number of part-prototypes for each category and compare them to the latent features, which yields similarity maps. Each similarity map is pooled to generate a prototype presence vector for each class. The final logits are computed as a weighted average of the presence vector. b. Non-parametric Prototype Update: We employ feature clustering to discover class-wise non-parametric part prototype from latent feature patches. Each part prototype can be regarded as the empirical mean of its corresponding part from all training examples. c. The feature space of the backbone is fine-tuned efficiently with part-prototype fixed, and by inserting new ViT blocks.
  • Figure 3: a. The Distinctiveness score for one sample is calculated based on the amount of overlap between areas attended by prototypes. b. The Comprehensiveness score is computed by comparing the ground truth foreground mask $M_x$ with the union of thresholded image region attended by prototypes.
  • Figure 4: Distinctiveness (%) and Comprehensiveness (%) across various ProtoPNets, evaluated on different box sizes and thresholds. All methods besides huang_evaluation_2023 are trained with DINOv2 ViT-B.
  • Figure 5: Visualization on the same sample across various ProtoPNet architectures. All models are trained using the same DINOv2 ViT-B backbone with $K=5$ unless annotated.