Table of Contents
Fetching ...

Comprehensive Evaluation of Prototype Neural Networks

Philipp Schlinge, Steffen Meinert, Martin Atzmueller

TL;DR

Prototype-based networks promise intrinsic interpretability but require rigorous evaluation of their explanations. The authors extend prior work with 22 metrics, including 13 novel ones, and release the QuanProto library to enable reproducible benchmarking across four diverse datasets (fine-grained, non-IID, and multi-label). Through systematic comparisons of ProtoPNet, ProtoPool, and PIPNet, the study finds that learning clearly semantic prototypes is difficult and highly dataset-dependent, though contrastive learning in PIPNet yields the strongest interpretability signals and robustness. The work provides a practical toolkit for evaluating prototypes and informs future design toward more reliable, interpretable models.

Abstract

Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility -- providing the option for easily adding new metrics and models.

Comprehensive Evaluation of Prototype Neural Networks

TL;DR

Prototype-based networks promise intrinsic interpretability but require rigorous evaluation of their explanations. The authors extend prior work with 22 metrics, including 13 novel ones, and release the QuanProto library to enable reproducible benchmarking across four diverse datasets (fine-grained, non-IID, and multi-label). Through systematic comparisons of ProtoPNet, ProtoPool, and PIPNet, the study finds that learning clearly semantic prototypes is difficult and highly dataset-dependent, though contrastive learning in PIPNet yields the strongest interpretability signals and robustness. The work provides a practical toolkit for evaluating prototypes and informs future design toward more reliable, interpretable models.

Abstract

Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility -- providing the option for easily adding new metrics and models.

Paper Structure

This paper contains 32 sections, 25 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: A visual representation of a prototype-based neural network. Starting from (A) a deep neural network, such as ResNet, extracting a feature map (B), the model identifies the presence of prototypes (C) by comparing them with feature vectors; generating similarity maps (D). Classification (F) is performed according to the highest similarity scores (E) derived from these maps. Prototypes are visualized (G) using the Prototypical Relevance Propagation (PRP) method. Note cards list the metrics that focus on that specific component. Some metrics are used in multiple evaluation techniques, which is later indicated by a subscript, see table \ref{['tab:metrics']}
  • Figure 2: The output-completeness of the visualization method is assessed by measuring the change in different parts of the model when the image is perturbed based on a prototype's visualization. The continuity of a model is assessed by measuring the effect of augmentations to prototypes.
  • Figure 3: Contrastivity evaluation results. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation.
  • Figure 4: Continuity evaluation results. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation.
  • Figure 5: Output-Completeness (left) and Complexity (right) evaluation results on the CUB-200-2011 dataset. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation.
  • ...and 2 more figures