Table of Contents
Fetching ...

Disentangling Visual Transformers: Patch-level Interpretability for Image Classification

Guillaume Jeanneret, Loïc Simon, Frédéric Jurie

TL;DR

The paper addresses the interpretability gap in Vision Transformers by introducing HiT, a Hindered Transformer that enforces patch-local information and disentangles patch contributions to enable CLS to be expressed as a linear sum of patch-level terms. By updating only the CLS token through multi-head attention while keeping image tokens processed by MLP, HiT yields intrinsic patch-level saliency without external tools, enabling CAM-like saliency maps and layer-wise attributions. Across six diverse datasets, HiT demonstrates superior interpretability (via insertion-deletion metrics) with a modest sacrifice in top-1 accuracy compared to non-interpretable ViTs, and provides thorough qualitative analyses, sanity checks, and ablations. The work highlights a practical, interpretable-by-design alternative for applications where transparency is critical, while acknowledging slower convergence and potential limitations in modeling complex spatial dependencies, pointing to future improvements in training efficiency and inter-token interactions.

Abstract

Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the self-attention mechanism, which mixes visual information across the whole image in a complex way. In this paper, we propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. Our proposed architecture rethinks the design of transformers to better disentangle patch influences at the classification stage. Ultimately, HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance, making it an attractive alternative for applications where interpretability is paramount.

Disentangling Visual Transformers: Patch-level Interpretability for Image Classification

TL;DR

The paper addresses the interpretability gap in Vision Transformers by introducing HiT, a Hindered Transformer that enforces patch-local information and disentangles patch contributions to enable CLS to be expressed as a linear sum of patch-level terms. By updating only the CLS token through multi-head attention while keeping image tokens processed by MLP, HiT yields intrinsic patch-level saliency without external tools, enabling CAM-like saliency maps and layer-wise attributions. Across six diverse datasets, HiT demonstrates superior interpretability (via insertion-deletion metrics) with a modest sacrifice in top-1 accuracy compared to non-interpretable ViTs, and provides thorough qualitative analyses, sanity checks, and ablations. The work highlights a practical, interpretable-by-design alternative for applications where transparency is critical, while acknowledging slower convergence and potential limitations in modeling complex spatial dependencies, pointing to future improvements in training efficiency and inter-token interactions.

Abstract

Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the self-attention mechanism, which mixes visual information across the whole image in a complex way. In this paper, we propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. Our proposed architecture rethinks the design of transformers to better disentangle patch influences at the classification stage. Ultimately, HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance, making it an attractive alternative for applications where interpretability is paramount.

Paper Structure

This paper contains 23 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: ViT and HiT blocks. While the ViT block mixes the patch data, HiT uniquely updates the CLS via the MHA, but avoids post-processing the classification token in the MLP, allowing the CLS to be unrolled at the last layer as individual contributions.
  • Figure 2: Saliency Maps computation using HiT. From the results from §\ref{['sec:mha']} and the definition of our architecture, HiT enables to extract the individual contribution per token and per layer. By adding together all tokens per layer, we can rearrange the tokens in a spatial layout and use the linear layer à la CAM zhou2016learning to extract the contribution of each token.
  • Figure 3: Interpretability comparison. We tested whether HiT's saliency maps provide better information than ProtoPFormer's, A-ViT's maps, B-cos, the rollout attention, and GradCAM. The results indicate that our methods are indeed more interpretable.
  • Figure 4: Qualitative Comparison. We show the image and its saliency maps produced by HiT and their homologuous using Rollout and GradCAM. We noticed that HiT tends to use the object's features in the image for its prediction, independently if its prediction is erroneous or not.
  • Figure 5: Layer Saliency. HiT has more advantages than just image saliency. (a) The first experiment shows that HiT computes the contribution per layer. Without any surprise, the final layers have a greater contribution. (b) We empirically validate our findings in ImageNet with a variety of experiments. Indeed, the results show that by removing certain layers, we obtain larger expected results congruent with the layer saliency.
  • ...and 5 more figures