Table of Contents
Fetching ...

What do Vision Transformers Learn? A Visual Exploration

Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu, Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein

TL;DR

The paper tackles a core interpretability question for Vision Transformers by introducing an optimization-based feature-visualization framework tailored to ViTs. It demonstrates that high-dimensional feed-forward representations yield the most informative visualizations, while attention components are less amenable to interpretation, and it shows that ViTs preserve spatial information up to the last layer where a learned token-mixing operation acts as global pooling. Through large-scale visualizations across ViT variants and analyses comparing ViTs to CNNs, the authors reveal a consistent progression from textures to parts to objects, a background-friendly but high-frequency-insensitive reliance, and strong locality of patch information even in deep layers. They further show that language supervision via CLIP induces semantic and conceptual features, including abstract category detectors, enhancing transferability and robustness. Overall, the work provides actionable insights into ViT inductive biases, informs architectural considerations, and offers a practical visualization pipeline for ongoing interpretability research.

Abstract

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.

What do Vision Transformers Learn? A Visual Exploration

TL;DR

The paper tackles a core interpretability question for Vision Transformers by introducing an optimization-based feature-visualization framework tailored to ViTs. It demonstrates that high-dimensional feed-forward representations yield the most informative visualizations, while attention components are less amenable to interpretation, and it shows that ViTs preserve spatial information up to the last layer where a learned token-mixing operation acts as global pooling. Through large-scale visualizations across ViT variants and analyses comparing ViTs to CNNs, the authors reveal a consistent progression from textures to parts to objects, a background-friendly but high-frequency-insensitive reliance, and strong locality of patch information even in deep layers. They further show that language supervision via CLIP induces semantic and conceptual features, including abstract category detectors, enhancing transferability and robustness. Overall, the work provides actionable insights into ViT inductive biases, informs architectural considerations, and offers a practical visualization pipeline for ongoing interpretability research.

Abstract

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.
Paper Structure (16 sections, 4 equations, 64 figures, 3 tables)

This paper contains 16 sections, 4 equations, 64 figures, 3 tables.

Figures (64)

  • Figure 1: The progression for visualized features of ViT B-32. Features from early layers capture general edges and textures. Moving into deeper layers, features evolve to capture more specialized image components and finally concrete objects.
  • Figure 2: Features from ViT trained with CLIP that relates to the category of morbidity.Top-left image in each category: Image optimized to maximally activate a feature from layer 10. Rest: Seven of the ten ImageNet images that most activate the feature.
  • Figure 3: (a): Example feature visualization from ViT feed forward layer.Left: Image optimized to maximally activate a feature from layer 5. Center: Corresponding maximally activating ImageNet example. Right: The image's patch-wise activation map. (b): A feature from the last layer most activated by shopping carts.
  • Figure 4: Left: Visualization of key, query, and value. The visualization both fails to extract interpretable features and to distinguish between early and deep layers. High-frequency patterns and adversarial behavior dominate. Right: ViT feed forward layer. The first linear layer increases the dimension of the feature space, and the second one brings it back to its initial dimension.
  • Figure 5: ViT feed forward layer. The first linear layer increases the dimension of the feature space, and the second one brings it back to its initial dimension.
  • ...and 59 more figures