A comprehensive study on fidelity metrics for XAI
Miquel Miró-Nicolau, Antoni Jaume-i-Capó, Gabriel Moyà-Alcover
TL;DR
Problem: Fidelity metrics for XAI lack ground truth and show inconsistent results across methods. Approach: The authors introduce a ground-truth verification framework using a transparent model (decision tree) as an objective benchmark and evaluate four fidelity metrics on two synthetic 52k-image datasets (AIXI-Shape and TXUXIv3). Findings: None of the metrics reliably matched the true fidelity, and performance degraded with higher OOD content. Impact: The study advocates developing new fidelity metrics and adopting the proposed benchmark to enable reliable evaluation in XAI research.
Abstract
The use of eXplainable Artificial Intelligence (XAI) systems has introduced a set of challenges that need resolution. Herein, we focus on how to correctly select an XAI method, an open questions within the field. The inherent difficulty of this task is due to the lack of a ground truth. Several authors have proposed metrics to approximate the fidelity of different XAI methods. These metrics lack verification and have concerning disagreements. In this study, we proposed a novel methodology to verify fidelity metrics, using a well-known transparent model, namely a decision tree. This model allowed us to obtain explanations with perfect fidelity. Our proposal constitutes the first objective benchmark for these metrics, facilitating a comparison of existing proposals, and surpassing existing methods. We applied our benchmark to assess the existing fidelity metrics in two different experiments, each using public datasets comprising 52,000 images. The images from these datasets had a size a 128 by 128 pixels and were synthetic data that simplified the training process. All metric values, indicated a lack of fidelity, with the best one showing a 30 \% deviation from the expected values for perfect explanation. Our experimentation led us to conclude that the current fidelity metrics are not reliable enough to be used in real scenarios. From this finding, we deemed it necessary to development new metrics, to avoid the detected problems, and we recommend the usage of our proposal as a benchmark within the scientific community to address these limitations.
