Table of Contents
Fetching ...

Finding the right XAI method -- A Guide for the Evaluation and Ranking of Explainable AI Methods in Climate Science

Philine Bommer, Marlene Kretschmer, Anna Hedström, Dilyara Bareeva, Marina M. -C. Höhne

TL;DR

The paper formulates a principled framework to evaluate explainable AI methods in climate science by benchmarking local, model-aware explainers against a random baseline across five properties: robustness, faithfulness, randomness, complexity, and localization. Using a case study that predicts the decade from annual-mean temperature maps with both an MLP and a CNN, it demonstrates how explainers such as Integrated Gradients, Layerwise Relevance Propagation, and input times gradient vary in suitability depending on network architecture and task. The study finds that perturbation-based methods can improve robustness in CNNs, while salience-based methods often yield higher faithfulness and lower complexity, though results depend on ROI definitions and data variability. By introducing a Quantus-based skill-score framework, the authors provide actionable guidance for climate researchers to select XAI methods tailored to their research questions, data, and model structures, thereby enhancing interpretability without sacrificing rigor. The work thus advances practical, task-specific XAI evaluation in climate AI and offers a replicable benchmark for future studies.

Abstract

Explainable artificial intelligence (XAI) methods shed light on the predictions of machine learning algorithms. Several different approaches exist and have already been applied in climate science. However, usually missing ground truth explanations complicate their evaluation and comparison, subsequently impeding the choice of the XAI method. Therefore, in this work, we introduce XAI evaluation in the climate context and discuss different desired explanation properties, namely robustness, faithfulness, randomization, complexity, and localization. To this end, we chose previous work as a case study where the decade of annual-mean temperature maps is predicted. After training both a multi-layer perceptron (MLP) and a convolutional neural network (CNN), multiple XAI methods are applied and their skill scores in reference to a random uniform explanation are calculated for each property. Independent of the network, we find that XAI methods Integrated Gradients, layer-wise relevance propagation, and input times gradients exhibit considerable robustness, faithfulness, and complexity while sacrificing randomization performance. Sensitivity methods -- gradient, SmoothGrad, NoiseGrad, and FusionGrad, match the robustness skill but sacrifice faithfulness and complexity for randomization skill. We find architecture-dependent performance differences regarding robustness, complexity and localization skills of different XAI methods, highlighting the necessity for research task-specific evaluation. Overall, our work offers an overview of different evaluation properties in the climate science context and shows how to compare and benchmark different explanation methods, assessing their suitability based on strengths and weaknesses, for the specific research problem at hand. By that, we aim to support climate researchers in the selection of a suitable XAI method.

Finding the right XAI method -- A Guide for the Evaluation and Ranking of Explainable AI Methods in Climate Science

TL;DR

The paper formulates a principled framework to evaluate explainable AI methods in climate science by benchmarking local, model-aware explainers against a random baseline across five properties: robustness, faithfulness, randomness, complexity, and localization. Using a case study that predicts the decade from annual-mean temperature maps with both an MLP and a CNN, it demonstrates how explainers such as Integrated Gradients, Layerwise Relevance Propagation, and input times gradient vary in suitability depending on network architecture and task. The study finds that perturbation-based methods can improve robustness in CNNs, while salience-based methods often yield higher faithfulness and lower complexity, though results depend on ROI definitions and data variability. By introducing a Quantus-based skill-score framework, the authors provide actionable guidance for climate researchers to select XAI methods tailored to their research questions, data, and model structures, thereby enhancing interpretability without sacrificing rigor. The work thus advances practical, task-specific XAI evaluation in climate AI and offers a replicable benchmark for future studies.

Abstract

Explainable artificial intelligence (XAI) methods shed light on the predictions of machine learning algorithms. Several different approaches exist and have already been applied in climate science. However, usually missing ground truth explanations complicate their evaluation and comparison, subsequently impeding the choice of the XAI method. Therefore, in this work, we introduce XAI evaluation in the climate context and discuss different desired explanation properties, namely robustness, faithfulness, randomization, complexity, and localization. To this end, we chose previous work as a case study where the decade of annual-mean temperature maps is predicted. After training both a multi-layer perceptron (MLP) and a convolutional neural network (CNN), multiple XAI methods are applied and their skill scores in reference to a random uniform explanation are calculated for each property. Independent of the network, we find that XAI methods Integrated Gradients, layer-wise relevance propagation, and input times gradients exhibit considerable robustness, faithfulness, and complexity while sacrificing randomization performance. Sensitivity methods -- gradient, SmoothGrad, NoiseGrad, and FusionGrad, match the robustness skill but sacrifice faithfulness and complexity for randomization skill. We find architecture-dependent performance differences regarding robustness, complexity and localization skills of different XAI methods, highlighting the necessity for research task-specific evaluation. Overall, our work offers an overview of different evaluation properties in the climate science context and shows how to compare and benchmark different explanation methods, assessing their suitability based on strengths and weaknesses, for the specific research problem at hand. By that, we aim to support climate researchers in the selection of a suitable XAI method.
Paper Structure (36 sections, 28 equations, 15 figures, 3 tables)

This paper contains 36 sections, 28 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Schematic of the XAI evaluation procedure. Based on an annual temperature anomaly map as input, the network predicts the respective decade (box $1$). The explanation methods (Grad - gradient, SG - SmoothGrad applied to gradient, LRP - layer-wise relevance propagation) then provide insights (i.e., "shine a light", see box $2$) into the specific network's decision. The different explanation maps (marked in orange - Grad, green - SG, and blue - LRP) highlight different areas as positively (red) and negatively (blue) contributing to the network decision. Here XAI evaluation can 'shine a light' on the explanation methods and help choose a suitable method (here indicated by the first rank) since evaluation explores the explanation maps regarding their robustness, faithfulness, localization, complexity, and randomization properties.
  • Figure 2: Diagram of the concept behind the robustness property. Perturbed input images are created by adding uniform noise maps of small magnitude to the original temperature map (left part of Figure). The perturbed maps are passed to the network, resulting in an explanation map for each prediction. The explanation maps of the perturbed inputs (explanation maps with grey outlines) are then compared to (indicated by a minus sign) the explanation of the unperturbed input (explanation map with black outline). A robust XAI method is expected to produce similar explanations for the perturbed input and unperturbed inputs.
  • Figure 3: Diagram of the concept behind the faithfulness property. Faithfulness assesses the impact of highly relevant pixels in the explanation map on the network decision. First, the explanation values are sorted to identify the highest relevance values (here shown in red). Next, the corresponding pixel positions in the flattened input temperature map are identified (see dotted arrows) and masked (marked in black); i.e., their value is set to a chosen masking value, such as $0$ or $1$. Both the masked and the original input maps are passed through the network and their predictions are compared. If the masking is based on a faithful explanation, the prediction of the masked input ($j$, grey) is expected to change compared to (indicated by a minus sign) the unmasked input ($i$, black), e.g., a different decade is predicted.
  • Figure 4: Diagram of the concept behind the complexity property. Complexity assesses how the evidence values are distributed across the explanation map. For this, the distribution of the relevance values from the original explanation is compared to a "random’’ explanation drawn from a random uniform distribution. Here, shown in a 1-D example, the evidence distribution of the explanation exhibits clear maxima and minima (see maxima in red oval), which is considered desirable and linked to increased scores. The noisy features show a uniform distribution linked to a low complexity score.
  • Figure 5: Diagram of the concept behind the localization property. First, an expected region of high relevance for the network decision, the region of interest (ROI), is defined in the input temperature map (blue box). Here, the North Atlantic is chosen, as this region has been discussed to affect the prediction (see Labe_2021). Next, the sorted explanation values of the ROI, encompassing $k$ pixels, are compared to the $k$ highest values of the sorted explanation values across all pixels. An explanation method with strong localization should assign the highest relevance values to the ROI.
  • ...and 10 more figures