Table of Contents
Fetching ...

On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification

Jonas Klotz, Tom Burgert, Begüm Demir

TL;DR

This work tackles the gap that CV-derived explainable AI methods and evaluation metrics may not translate well to RS image scene classification. By systematically evaluating ten explanation metrics across five feature-attribution methods on three RS datasets, it reveals that robustness and randomization metrics are comparatively more reliable in RS, while faithfulness, localization, and complexity metrics exhibit RS-specific limitations. Grad-CAM emerges as a broadly effective attribution method across categories, though no single method excels across all criteria. The paper also offers practical guidelines for selecting explanations and metrics in RS and emphasizes the need for RS-tailored xAI methods and metrics. Collectively, these findings advance reliable interpretability practices and set directions for RS-focused xAI research and evaluation frameworks.

Abstract

The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.

On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification

TL;DR

This work tackles the gap that CV-derived explainable AI methods and evaluation metrics may not translate well to RS image scene classification. By systematically evaluating ten explanation metrics across five feature-attribution methods on three RS datasets, it reveals that robustness and randomization metrics are comparatively more reliable in RS, while faithfulness, localization, and complexity metrics exhibit RS-specific limitations. Grad-CAM emerges as a broadly effective attribution method across categories, though no single method excels across all criteria. The paper also offers practical guidelines for selecting explanations and metrics in RS and emphasizes the need for RS-tailored xAI methods and metrics. Collectively, these findings advance reliable interpretability practices and set directions for RS-focused xAI research and evaluation frameworks.

Abstract

The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.

Paper Structure

This paper contains 36 sections, 34 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: The evaluation protocol for selecting appropriate explanation methods and metrics for rs image scene classification. It involves generating explanations using various feature attribution methods, estimating their quality using selected metrics, assessing the reliability of the metric with MetaQuantus hedstrom2023meta, and using the most reliable metrics to select the most suitable explanation methods.
  • Figure 2: Visualizations of perturbed images using $\alpha =0$ from the dg DeepGlobe18 dataset for the target class $c$ (agricultural land) and its prediction certainty for the original image: $f_{c}(\boldsymbol{{x}}) = 0.9$. a) Perturbed sample, where $f_{c}(\boldsymbol{\tilde{x}}) = 0.99$: b) Perturbed sample, where $f_{c}(\boldsymbol{\tilde{x}}) = 0.38$.
  • Figure 3: LRP visualizations for an image from the dg DeepGlobe18 dataset. a) Original image with two classes: Urban Land and Agricultural Land; b) Pixel-wise reference map (blue: Urban Land, orange: Agricultural Land); c) LRP explanation for Agricultural Land (normalized range: [0,1], unnormalized range: [0,0.02]); d) LRP explanation for Urban Land (normalized range: [0,1], unnormalized range: [0,0.18]).
  • Figure 4: Comparison of morf and lerf removal strategies for the target class $c$ (agricultural land, orange). Here, $K$ is the percentage of removed pixels and $f_c(\boldsymbol{x})$ represents the prediction certainty. Top row: a) Original image from the dg DeepGlobe18 dataset; b) Pixel-wise reference map. Agriculture land in orange; morf removal: c) $K=10\%$, $f_c(\boldsymbol{x})=0.98$; d) $K=50\%$, $f_c(\boldsymbol{x})=0.64$; e) $K=90\%$, $f_c(\boldsymbol{x})=0.31$. Bottom row: f) Grad-CAM attribution map for $c$; lerf removal: g) $K=10\%$, $f_c(\boldsymbol{x})=0.99$; h) $K=50\%$, $f_c(\boldsymbol{x})=0.99$; i) $K=90\%$, $f_c(\boldsymbol{x})=0.72$.
  • Figure 5: Explanations for an image from the dg dataset DeepGlobe18 labeled as Forest and their corresponding sp metric scores. a) Original image; b) Pixel-wise reference map (green: Forest); c) Occlusion explanation ($\Psi_{\mathrm{SP}} = 0.08$); d) LRP explanation ($\Psi_{\mathrm{SP}} = 0.47$).
  • ...and 7 more figures