Comprehensive Evaluation of Prototype Neural Networks
Philipp Schlinge, Steffen Meinert, Martin Atzmueller
TL;DR
Prototype-based networks promise intrinsic interpretability but require rigorous evaluation of their explanations. The authors extend prior work with 22 metrics, including 13 novel ones, and release the QuanProto library to enable reproducible benchmarking across four diverse datasets (fine-grained, non-IID, and multi-label). Through systematic comparisons of ProtoPNet, ProtoPool, and PIPNet, the study finds that learning clearly semantic prototypes is difficult and highly dataset-dependent, though contrastive learning in PIPNet yields the strongest interpretability signals and robustness. The work provides a practical toolkit for evaluating prototypes and informs future design toward more reliable, interpretable models.
Abstract
Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility -- providing the option for easily adding new metrics and models.
