In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

Trishit Mondal; Ameya D. Jagtap

In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

Trishit Mondal, Ameya D. Jagtap

TL;DR

This work systematically examine the trustworthiness of transformer-based models in safety-critical applications spanning natural language processing, computer vision, and science and engineering domains, including robotics, medicine, earth sciences, materials science, fluid dynamics, nuclear science, and automated theorem proving; highlighting high-impact areas where these architectures are central and analyzing the risks associated with their deployment.

Abstract

Transformer architectures have revolutionized machine learning across a wide range of domains, from natural language processing to scientific computing. However, their growing deployment in high-stakes applications, such as computer vision, natural language processing, healthcare, autonomous systems, and critical areas of scientific computing including climate modeling, materials discovery, drug discovery, nuclear science, and robotics, necessitates a deeper and more rigorous understanding of their trustworthiness. In this work, we critically examine the foundational question: \textitHow trustworthy are transformer models?} We evaluate their reliability through a comprehensive review of interpretability, explainability, robustness against adversarial attacks, fairness, and privacy. We systematically examine the trustworthiness of transformer-based models in safety-critical applications spanning natural language processing, computer vision, and science and engineering domains, including robotics, medicine, earth sciences, materials science, fluid dynamics, nuclear science, and automated theorem proving; highlighting high-impact areas where these architectures are central and analyzing the risks associated with their deployment. By synthesizing insights across these diverse areas, we identify recurring structural vulnerabilities, domain-specific risks, and open research challenges that limit the reliable deployment of transformers.

In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

TL;DR

Abstract

Paper Structure (28 sections, 34 figures)

This paper contains 28 sections, 34 figures.

Introduction
Interpretability and Explainability in Transformers
Attention Visualization
Symbolic and Neuro-Symbolic Transformers
Robustness of Transformer
The Fragility of Learned Representations
Architectural Robustness: Vision Transformers vs. CNNs
Enhancing Reliability: Training, Architecture, and Verification
Fairness and Bias Mitigation
Amplification of Systemic Bias
Bias Mitigation Strategies
Privacy-Preserving Architectures
Trustworthy Transformers for Scientific Machine Learning
Physics-Informed Transformers for Scientific Computing
Uncertainty Quantification in Transformer
...and 13 more sections

Figures (34)

Figure 1: Key dimensions of trustworthy transformer architectures.
Figure 2: Visualization of BERT’s neuron activations for layer 0, head 0. Positive and negative values are represented using blue and orange hues, respectively, with color intensity indicating the magnitude. Similar to the attention-head view, the connecting lines reflect attention strength between words, with line thickness proportional to the attention weights. Adapted from Vig vig2019multiscale
Figure 3: Attribution maps from attention rollout (top) and attention flow (bottom) across six Transformer layers for three example sentences. Both methods highlight influential tokens (e.g., subject vs. distractors) more effectively than raw attention. Rollout produces sharper focus, while flow reveals broader influence patterns, aiding interpretability in deeper layers. Adapted from Abnar et al. abnar2020quantifying.
Figure 4: Comparison of interpretability methods on visual tasks. The top row displays the input image containing multiple objects. Subsequent rows show attribution maps generated by Raw Attention, Rollout, GradCAM, and Chefer et al.'s Relevance Propagation (Ours). Note that the proposed method (bottom) successfully isolates class-specific features (e.g., distinguishing between the two dogs) with significantly less noise than heuristic methods like Rollout or GradCAM. Adapted from Chefer et al. chefer2021transformer.
Figure 5: Visual comparison of attribution maps generated by IA-ViT, Attention Rollout, and AttGrad methods. IA-ViT consistently highlights the most relevant image regions (e.g., the face of a cat or hair of a person) with high precision and minimal noise. In contrast, Rollout often produces diffuse attributions covering irrelevant background areas, while AttGrad occasionally emphasizes uninformative regions. Adapted from Qiang et al. qiang2023interpretability
...and 29 more figures

In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

TL;DR

Abstract

In Transformer We Trust? A Perspective on Transformer Architecture Failure Modes

Authors

TL;DR

Abstract

Table of Contents

Figures (34)