What Do Self-Supervised Vision Transformers Learn?

Namuk Park; Wonjae Kim; Byeongho Heo; Taekyung Kim; Sangdoo Yun

What Do Self-Supervised Vision Transformers Learn?

Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, Sangdoo Yun

TL;DR

<3-5 sentence high-level summary> This study provides a rigorous comparison between contrastive learning (CL) and masked image modeling (MIM) for Vision Transformers, showing that CL emphasizes global shape information and tends to collapse attention in later layers, while MIM preserves token-level diversity and focuses on local texture information. Through analyses of self-attention, representation transforms, and layer roles (including Fourier and information-theoretic measures), the authors demonstrate fundamental biases: CL is shape-biased and suited for linear probing, whereas MIM is texture-biased and excels in dense prediction with larger models. The work demonstrates that CL and MIM are complementary, and that simple hybrids can outperform either method alone across downstream tasks. It also highlights practical design insights, such as leveraging explicit decoders for MIM and exploring layer-wise, non-uniform integration of CL and MIM objectives for future SSL methods.

Abstract

We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. (2) CL utilizes the low-frequency signals of the representations, but MIM utilizes high-frequencies. Since low- and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. (3) CL plays a crucial role in the later layers, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. The code is available at https://github.com/naver-ai/cl-vs-mim.

What Do Self-Supervised Vision Transformers Learn?

TL;DR

Abstract

Paper Structure (29 sections, 19 figures, 2 tables)

This paper contains 29 sections, 19 figures, 2 tables.

Introduction
How Do Self-Attentions Behave?
CL mainly captures global relationships.
Self-attentions of CL collapse into homogeneity.
Attention collapse reduces representational diversity.
Implications of the behaviors we observed.
How Are Representations Transformed?
CL transforms all tokens in unison, while MIM does so individually.
CL exploits low-frequencies, and MIM exploits high-frequencies.
CL is shape-biased, but MIM is texture-biased.
Which Components Play an Important Role?
Later layers of CL and early layers of MIM are important.
The explicit decoder helps ViTs further leverage the advantages of MIM.
Are the Two Methods Complementary to Each Other?
Conclusion
...and 14 more sections

Figures (19)

Figure 1: Self-attentions of CL (MoCo) capture global information, but they collapse into homogeneous attention maps for all query tokens and heads. Self-attentions of MIM (SimMIM) mainly focus on local areas and similar tokens. We visualize the attention maps for two different query tokens in the beginning through the end layers. We omit the results for self-attention heads, which show mostly consistent results. Left: Self-attentions of CL capture global patterns and the shape of an object. However, all attention maps capture the same shape information regardless of the query tokens. Right: Self-attentions of MIM capture local patterns and are correlated with queries.
Figure 2: Effective receptive fields of CL are global, but those of MIM are local. This is particularly evident in the later layers.
Figure 4: CL lacks representational diversity in the later layers. We measure cosine similarities of representations in the self-attentions between the heads (left), depths (middle), and spatial coordinates (right). All of the results show that the representational similarity of later self-attentions of CL is higher than that of MIM. Increasing heads or depths of CL is not effective in improving the diversity. Left: The similarity of representations from two heads in self-attention. Middle: The similarity between representations before and after self-attentions transform them. Right: The similarities of representations at two spatial coordinates. ViT-{S, L} is trained with 100 epochs.
Figure 5: Self-attention layers of CL and MIM transform representations differently. We visualize 196 spatial representation tokens for an example validation image in a representation space. The blue ($\bullet$) and red ($\bullet$) data points denote the tokens before and after the self-attention transformation. Left: The self-attentions of CL (e.g., MoCo) translate all the tokens equally, so the distances between the tokens of an image do not increase. Middle: However, CL moves the "centers of representations (represented by $\times$)" away from each other. Therefore, the images are linearly separable. The circle ($\bullet$) and triangle ($\triangle$) data represent tokens from different images. Right: The self-attentions of MIM (e.g., SimMIM) transform representations differently according to query tokens, thus increasing the distances between tokens. See \ref{['fig:transform:quantitative']} for quantitative analyses.
Figure 6: CL barely changes or even decreases the distribution volume of tokens from a single image, implying that it hardly distinguishes between token. Instead, it significantly increases the distribution volume of images. To demonstrate these properties, we visualize singular value spectra, the singular values of the distribution of representations sorted by the magnitude. The higher a singular value, the larger the volume of a distribution. The right of this figure shows the $64^\text{th}$ and $128^\text{th}$ highest singular value for depth. Top: Singular value spectra of tokens from a single image. CL decreases the singular values of the tokens, but MIM increases. Bottom: Singular value spectra of images. CL significantly increases the volumes occupied by images, but MIM hardly does so.
...and 14 more figures

What Do Self-Supervised Vision Transformers Learn?

TL;DR

Abstract

What Do Self-Supervised Vision Transformers Learn?

Authors

TL;DR

Abstract

Table of Contents

Figures (19)