From Confusion to Clarity: ProtoScore -- A Framework for Evaluating Prototype-Based XAI
Helena Monke, Benjamin Sae-Chew, Benjamin Fresz, Marco F. Huber
TL;DR
ProtoScore addresses the lack of objective benchmarks for prototype-based XAI, particularly in time-series contexts, by integrating the Co-12 properties into a unified, automated evaluation framework. It defines latent-space preliminaries, extends the Co-12 properties with prototype-specific metrics, and provides concrete formulas to quantify correctness, consistency, continuity, contrastivity, covariate complexity, compactness, confidence, input completeness, and latent-space cohesion. Through exemplary use cases and multi-dataset experiments, the framework demonstrates how MAP and MSP prototype methods fare across diverse metrics, guiding practitioners in method selection while highlighting trade-offs and dataset dependencies. The framework emphasizes reproducibility, reduces reliance on costly user studies, and offers a path toward richer, human-centered validation by connecting quantitative metrics with eventual user studies.
Abstract
The complexity and opacity of neural networks (NNs) pose significant challenges, particularly in high-stakes fields such as healthcare, finance, and law, where understanding decision-making processes is crucial. To address these issues, the field of explainable artificial intelligence (XAI) has developed various methods aimed at clarifying AI decision-making, thereby facilitating appropriate trust and validating the fairness of outcomes. Among these methods, prototype-based explanations offer a promising approach that uses representative examples to elucidate model behavior. However, a critical gap exists regarding standardized benchmarks to objectively compare prototype-based XAI methods, especially in the context of time series data. This lack of reliable benchmarks results in subjective evaluations, hindering progress in the field. We aim to establish a robust framework, ProtoScore, for assessing prototype-based XAI methods across different data types with a focus on time series data, facilitating fair and comprehensive evaluations. By integrating the Co-12 properties of Nauta et al., this framework allows for effectively comparing prototype methods against each other and against other XAI methods, ultimately assisting practitioners in selecting appropriate explanation methods while minimizing the costs associated with user studies. All code is publicly available at https://github.com/HelenaM23/ProtoScore .
