Table of Contents
Fetching ...

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Melkamu Abay Mersha, Jugal Kalita

TL;DR

This paper addresses the opacity of Transformer models by identifying gaps in existing explainability methods, notably final-layer bias, lack of unified local–global reasoning, and insufficient context awareness across layers.It introduces the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework, which computes Layer-wise Integrated Gradients at every Transformer block, and fuses these with class-specific attention gradients through a context-aware integration and relevance-rollout mechanism to produce signed, hierarchical attributions.CA-LIG demonstrates stronger faithfulness and contextual coherence across NLP and vision tasks, validated on BERT-based, multilingual, and MAE vision Transformer models, outperforming baselines in both qualitative visualizations and quantitative metrics such as token-F1 and perturbation AUC.The approach advances Transformer interpretability by capturing the evolution of evidence across layers, bridging local token contributions and global structural dependencies, with potential for broader applicability and future extensions to decoder-based and multimodal architectures.

Abstract

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

TL;DR

This paper addresses the opacity of Transformer models by identifying gaps in existing explainability methods, notably final-layer bias, lack of unified local–global reasoning, and insufficient context awareness across layers.It introduces the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework, which computes Layer-wise Integrated Gradients at every Transformer block, and fuses these with class-specific attention gradients through a context-aware integration and relevance-rollout mechanism to produce signed, hierarchical attributions.CA-LIG demonstrates stronger faithfulness and contextual coherence across NLP and vision tasks, validated on BERT-based, multilingual, and MAE vision Transformer models, outperforming baselines in both qualitative visualizations and quantitative metrics such as token-F1 and perturbation AUC.The approach advances Transformer interpretability by capturing the evolution of evidence across layers, bridging local token contributions and global structural dependencies, with potential for broader applicability and future extensions to decoder-based and multimodal architectures.

Abstract

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.
Paper Structure (25 sections, 13 equations, 18 figures, 2 tables)

This paper contains 25 sections, 13 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Proposed architecture of the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework.
  • Figure 2: CA-LIG token-level attributions for a document labeled Christian class from the 20 Newsgroups dataset using BERT-large. Brighter green tokens provide stronger positive evidence, lighter green indicates weaker support, red shows negative influence, and white denotes neutral relevance.
  • Figure 3: CA-LIG token-level attributions for a document labeled atheist class from the 20 Newsgroups dataset using BERT-base. Brighter green tokens provide stronger positive evidence, lighter green indicates weaker support, red shows negative influence, and white denotes neutral relevance.
  • Figure 4: CA-LIG token-level attributions for a negative IMDB review using BERT-Large. Brighter red indicates stronger negative evidence, green indicates positive relevance, and white denotes neutral tokens.
  • Figure 5: CA-LIG token-level attributions for an Amharic hate speech sample using the XLM-R model. Brighter red indicates stronger negative evidence, green indicates positive relevance, and white denotes neutral tokens.
  • ...and 13 more figures