Table of Contents
Fetching ...

Interpretable Vision Transformers in Image Classification via SVDA

Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos

TL;DR

The paper tackles the interpretability gap in Vision Transformers by introducing SVDA, a spectral-value decomposed attention mechanism that decouples directional information from spectral importance through soft-orthonormal projections and a learned diagonal matrix $\Sigma$ in the attention computation $A \sim Q \Sigma K^\top$. By integrating SVDA into ViTs while preserving architectural compatibility, the authors demonstrate that attention becomes more structured and sparse without sacrificing accuracy across CIFAR-10, CIFAR-100, FashionMNIST, and ImageNet-100. They also introduce six interpretability indicators—spectral entropy, effective rank, spectral sparsity, angular alignment, selectivity index, and perturbation robustness—to diagnose attention dynamics at the head and layer level, tracked throughout training. The results show SVDA maintains competitive performance while providing richer, geometry-grounded interpretability and spectral diagnostics, establishing a foundation for explainable AI, spectral-based diagnostics, and potential attention-regularization strategies in vision models.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

Interpretable Vision Transformers in Image Classification via SVDA

TL;DR

The paper tackles the interpretability gap in Vision Transformers by introducing SVDA, a spectral-value decomposed attention mechanism that decouples directional information from spectral importance through soft-orthonormal projections and a learned diagonal matrix in the attention computation . By integrating SVDA into ViTs while preserving architectural compatibility, the authors demonstrate that attention becomes more structured and sparse without sacrificing accuracy across CIFAR-10, CIFAR-100, FashionMNIST, and ImageNet-100. They also introduce six interpretability indicators—spectral entropy, effective rank, spectral sparsity, angular alignment, selectivity index, and perturbation robustness—to diagnose attention dynamics at the head and layer level, tracked throughout training. The results show SVDA maintains competitive performance while providing richer, geometry-grounded interpretability and spectral diagnostics, establishing a foundation for explainable AI, spectral-based diagnostics, and potential attention-regularization strategies in vision models.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
Paper Structure (13 sections, 1 equation, 4 figures)

This paper contains 13 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Accuracy over epochs for baseline and SVDA models across all datasets. The two models show nearly identical learning trajectories.
  • Figure 2: Training time per epoch for baseline and SVDA models. SVDA introduces an average overhead of approximately 17% due to its additional spectral normalization and modulation steps.
  • Figure 3: SVDA interpretability and structural attention diagnostics across datasets and interpretability indicators (part 1 of 2): (a) Spectral Entropy evolution per epoch; (b) Spectral Entropy per layer; (c) Effective Rank evolution per epoch; (d) Effective Rank per layer; (e) Angular Alignment evolution per epoch; (f) Angular Alignment per layer.
  • Figure 4: SVDA interpretability and structural attention diagnostics across datasets and interpretability indicators (part 2 of 2): (g) Selectivity Index evolution per epoch; (h) Selectivity Index per layer; (i) Spectral Sparsity evolution per epoch; (j) Spectral Sparsity per layer; (k) Perturbation Robustness evolution per epoch; (l) Perturbation Robustness per layer.