Table of Contents
Fetching ...

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang

TL;DR

This paper provides the first end-to-end theoretical convergence guarantees for self-supervised learning on vision transformers by analyzing a one-layer softmax ViT under MAE and CL objectives. It models vision data with two spatial feature types—global and local patches—controlled by an information gap $Δ$, and derives gradient-dynamics results showing MAE learns diverse feature-position correlations across all areas, fostering locality-aware attention, whereas CL concentrates on global feature-position correlations and collapses to a single global pattern. The analysis uses a phase-based decomposition (Phase I: decoupling global FP correlations; Phase II: growth of target local FP correlation) to characterize attention patterns and convergence, providing explicit rates and regimes. The results explain empirical observations that MAE preserves locality while CL emphasizes global structure, offering a rigorous framework for SSL with ViTs and guiding future theoretical work on spatial/temporal patterns in transformer models.

Abstract

Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical observations reveal that MAE and CL capture different types of representations: CL tends to focus on global patterns, while MAE adeptly captures both global and subtle local information simultaneously. Despite a flurry of recent empirical investigations to shed light on this difference, theoretical understanding remains limited, especially on the dominant architecture vision transformers (ViTs). In this paper, to provide rigorous insights, we model the visual data distribution by considering two types of spatial features: dominant global features and comparatively minuscule local features, and study the impact of imbalance among these features. We analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent. Our analysis shows that as the degree of feature imbalance varies, ViTs trained with the MAE objective effectively learn both global and local features to achieve near-optimal reconstruction, while the CL-trained ViTs favor predominantly global features, even under mild imbalance. These results provide a theoretical explanation for distinct behaviors of MAE and CL observed in empirical studies.

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

TL;DR

This paper provides the first end-to-end theoretical convergence guarantees for self-supervised learning on vision transformers by analyzing a one-layer softmax ViT under MAE and CL objectives. It models vision data with two spatial feature types—global and local patches—controlled by an information gap , and derives gradient-dynamics results showing MAE learns diverse feature-position correlations across all areas, fostering locality-aware attention, whereas CL concentrates on global feature-position correlations and collapses to a single global pattern. The analysis uses a phase-based decomposition (Phase I: decoupling global FP correlations; Phase II: growth of target local FP correlation) to characterize attention patterns and convergence, providing explicit rates and regimes. The results explain empirical observations that MAE preserves locality while CL emphasizes global structure, offering a rigorous framework for SSL with ViTs and guiding future theoretical work on spatial/temporal patterns in transformer models.

Abstract

Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical observations reveal that MAE and CL capture different types of representations: CL tends to focus on global patterns, while MAE adeptly captures both global and subtle local information simultaneously. Despite a flurry of recent empirical investigations to shed light on this difference, theoretical understanding remains limited, especially on the dominant architecture vision transformers (ViTs). In this paper, to provide rigorous insights, we model the visual data distribution by considering two types of spatial features: dominant global features and comparatively minuscule local features, and study the impact of imbalance among these features. We analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent. Our analysis shows that as the degree of feature imbalance varies, ViTs trained with the MAE objective effectively learn both global and local features to achieve near-optimal reconstruction, while the CL-trained ViTs favor predominantly global features, even under mild imbalance. These results provide a theoretical explanation for distinct behaviors of MAE and CL observed in empirical studies.
Paper Structure (98 sections, 63 theorems, 157 equations, 4 figures)

This paper contains 98 sections, 63 theorems, 157 equations, 4 figures.

Key Result

Theorem 4.1

Suppose the information gap $\Delta\in [-0.5,-\Omega(1)]\cup[\Omega(1),1]$. For any $0<\epsilon<1$, suppose $\operatorname{polylog}(P)\gg \log(\frac{1}{\epsilon})$. We train the ViTs in Definition def:model-arch by GD to minimize reconstruction loss in loss with $\eta\ll \operatorname{poly}(P)$. The

Figures (4)

  • Figure 1: Visualization of attention maps in the last layer of the ViTs for query patches from two different spatial locations, similar to those presented in park2023what. The ViTs were trained by the generative self-supervised learning approach of masked reconstruction (MAE) and discriminative methods: DINO caron2021emerging and MoCo Chen2021AnES.
  • Figure 2: Illustration of our data distribution (see \ref{['def:data']}). Each cluster $\mathcal{D}_k$ is segmented into distinct areas $\mathcal{P}_{k,j}$ , with squares in the same color representing the same area $\mathcal{P}_{k,j}$. The global area $\mathcal{P}_{k,1}$ (depicted in orange) contains a larger count of patches compared to any other local areas. It is important to note that while we use spatially contiguous partitions for clarity in this illustration, our data model is also applicable to non-contiguous cases.
  • Figure 3: Attention Diversity Metric: We design a novel empirical metric, the attention diversity metric, to probe the last layer of ViTs trained by masked reconstructions (MAE), CL (MoCo), another discriminative SSL approach (DINO), and supervised learning (DeiT). Lower values of this metric signify focused attention on a similar area across different patches, reflecting a global pattern of focus. Conversely, higher values suggest that attention is dispersed, focusing on different, localized areas. The results show that the MAE model excels in capturing diverse local patterns compared to discriminative methods like CL. (see \ref{['sec:exp']} for details).
  • Figure 4: The mechanism of how the masked patch attends to other patches via attention correlations in MAE.

Theorems & Definitions (102)

  • Definition 2.1: Data distribution $\mathcal{D}$
  • Definition 2.4: ViT architecture for MAE
  • Definition 2.5: Random masking
  • Definition 2.6: ViT architecture for CL
  • Definition 2.7: Data augmentation
  • Definition 3.1
  • Theorem 4.1: Training convergence
  • Theorem 4.2: Learning Feature-Position correlations
  • Theorem 4.4: Learning with contrastive objective
  • Lemma 5.1: FP correlations, informal
  • ...and 92 more