Table of Contents
Fetching ...

Investigating the Impact of Model Instability on Explanations and Uncertainty

Sara Vera Marjanović, Isabelle Augenstein, Christina Lioma

TL;DR

The paper tackles the challenge of trusting explanations from language models under input perturbations by conducting a large-scale, controlled study across multiple transformer models and text datasets. It systematically injects varying perturbations and evaluates their impact on both model uncertainty (epistemic and predictive) and gradient-based explanations, using a suite of explanation methods and robust metrics. The key findings reveal that high uncertainty does not necessarily equate to poor explanation plausibility, and that the instability of saliency maps under perturbations correlates with model performance in a nuanced, model- and dataset-dependent way; Integrated Gradients often provides the most robust explanations, especially for smaller models. The work emphasizes practical implications for XAI in noisy environments and suggests directions for noise-aware training and better datapoint-level measures of explanation quality to enhance trust in AI systems.

Abstract

Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.

Investigating the Impact of Model Instability on Explanations and Uncertainty

TL;DR

The paper tackles the challenge of trusting explanations from language models under input perturbations by conducting a large-scale, controlled study across multiple transformer models and text datasets. It systematically injects varying perturbations and evaluates their impact on both model uncertainty (epistemic and predictive) and gradient-based explanations, using a suite of explanation methods and robust metrics. The key findings reveal that high uncertainty does not necessarily equate to poor explanation plausibility, and that the instability of saliency maps under perturbations correlates with model performance in a nuanced, model- and dataset-dependent way; Integrated Gradients often provides the most robust explanations, especially for smaller models. The work emphasizes practical implications for XAI in noisy environments and suggests directions for noise-aware training and better datapoint-level measures of explanation quality to enhance trust in AI systems.

Abstract

Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.
Paper Structure (40 sections, 14 figures, 6 tables)

This paper contains 40 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: The effect of increasing text perturbation (averaged across perturbation type) on RoBERTa performance and uncertainty across three different hierarchies: (1) Random; (2) Human, following human annotation and POS tags; and (3) Gradient, following ranking of Hotflip gradients. Dotted lines show the value at $\alpha=0.0$.
  • Figure 2: The effect of increasing text perturbation on BERT accuracy, uncertainty and explanation plausibility across the different types of perturbation, averaged across perturbation hierarchy. Dotted lines show $\alpha=0.0$.
  • Figure 3: Model-level differences of the correlation to the unperturbed saliency map at low levels of perturbation. We separately show the effect on BERT, RoBERTa, ELECTRA, GPT2 and OPT.
  • Figure 4: We show the differential effect of our perturbation hierarchies across the different datasets investigated. Values are averaged over all 7 perturbation types.
  • Figure 5: We show the differential effect of increasing levels of text perturbation on model accuracy and both measures of uncertainty on the SST-2 dataset. Values are averaged over all hierarchies.
  • ...and 9 more figures