Table of Contents
Fetching ...

Enhancing Vision Transformer Explainability Using Artificial Astrocytes

Nicolas Echevarrieta-Catalan, Ana Ribas-Rodriguez, Francisco Cedron, Odelia Schwartz, Vanessa Aguiar-Pulido

TL;DR

Vision transformers often generate explanations that diverge from human intuition, hindering trust in automated vision systems. The authors propose ViTA, a training-free, neuroscience-inspired modification that injects artificial astrocytes into the first self-attention block to modulate activations and enhance human-aligned explanations. Across Grad-CAM and Grad-CAM++ heatmaps evaluated against the ClickMe ground truth, ViTA yields statistically significant improvements in alignment and produces more focused object-centered explanations. This biologically inspired modulation offers a general, training-free route to improve XAI for vision models and can extend to other architectures and datasets.

Abstract

Machine learning models achieve high precision, but their decision-making processes often lack explainability. Furthermore, as model complexity increases, explainability typically decreases. Existing efforts to improve explainability primarily involve developing new eXplainable artificial intelligence (XAI) techniques or incorporating explainability constraints during training. While these approaches yield specific improvements, their applicability remains limited. In this work, we propose the Vision Transformer with artificial Astrocytes (ViTA). This training-free approach is inspired by neuroscience and enhances the reasoning of a pretrained deep neural network to generate more human-aligned explanations. We evaluated our approach employing two well-known XAI techniques, Grad-CAM and Grad-CAM++, and compared it to a standard Vision Transformer (ViT). Using the ClickMe dataset, we quantified the similarity between the heatmaps produced by the XAI techniques and a (human-aligned) ground truth. Our results consistently demonstrate that incorporating artificial astrocytes enhances the alignment of model explanations with human perception, leading to statistically significant improvements across all XAI techniques and metrics utilized.

Enhancing Vision Transformer Explainability Using Artificial Astrocytes

TL;DR

Vision transformers often generate explanations that diverge from human intuition, hindering trust in automated vision systems. The authors propose ViTA, a training-free, neuroscience-inspired modification that injects artificial astrocytes into the first self-attention block to modulate activations and enhance human-aligned explanations. Across Grad-CAM and Grad-CAM++ heatmaps evaluated against the ClickMe ground truth, ViTA yields statistically significant improvements in alignment and produces more focused object-centered explanations. This biologically inspired modulation offers a general, training-free route to improve XAI for vision models and can extend to other architectures and datasets.

Abstract

Machine learning models achieve high precision, but their decision-making processes often lack explainability. Furthermore, as model complexity increases, explainability typically decreases. Existing efforts to improve explainability primarily involve developing new eXplainable artificial intelligence (XAI) techniques or incorporating explainability constraints during training. While these approaches yield specific improvements, their applicability remains limited. In this work, we propose the Vision Transformer with artificial Astrocytes (ViTA). This training-free approach is inspired by neuroscience and enhances the reasoning of a pretrained deep neural network to generate more human-aligned explanations. We evaluated our approach employing two well-known XAI techniques, Grad-CAM and Grad-CAM++, and compared it to a standard Vision Transformer (ViT). Using the ClickMe dataset, we quantified the similarity between the heatmaps produced by the XAI techniques and a (human-aligned) ground truth. Our results consistently demonstrate that incorporating artificial astrocytes enhances the alignment of model explanations with human perception, leading to statistically significant improvements across all XAI techniques and metrics utilized.

Paper Structure

This paper contains 11 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the tripartite synapse. Information is transmitted along the neural pathways as the presynaptic neuron sends neurotransmitters to the postsynaptic neuron. The astrocyte surrounds the synaptic cleft (i.e., the space between neurons). Astrocytes interact and help to modulate this process in the so-called tripartite synapse.
  • Figure 2: Proposed architecture: Vision Transformer with artificial Astrocytes (ViTA). Artificial astrocytes (stars) are added to the dense layer of the multi-head attention module of the encoder block 0.
  • Figure 3: Class activation maps produced by Grad-CAM and Grad-CAM++ for ViT and ViTA. The columns correspond to the: (1) original image, (2) (human-aligned) ground truth, Grad-CAM output for (3) ViT and (4) ViTA, and Grad-CAM++ output for (5) ViT and (6) ViTA. The numerical values above the images in columns 3–6 represent SSIM scores, indicating how closely the heatmaps generated by each method align with the ground truth.
  • Figure 4: Comparison of Grad-CAM and Grad-CAM++ applied to ViT and ViTA against ClickMe ground truth using Spearman, DSC and SSIM with the best parameter configurations from \ref{['tab:ViTA_metric_comparison_part1']}. Error bars represent variability.