Table of Contents
Fetching ...

Cutting the Skip: Training Residual-Free Transformers

Yiping Ji, James Martens, Jianqiao Zheng, Ziqin Zhou, Peyman Moghadam, Xinyu Zhang, Hemanth Saratchandran, Simon Lucey

TL;DR

This work analyzes the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy, and introduces the first method that enables stable and efficient training of skipless transformers without altering the standard architecture.

Abstract

Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

Cutting the Skip: Training Residual-Free Transformers

TL;DR

This work analyzes the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy, and introduces the first method that enables stable and efficient training of skipless transformers without altering the standard architecture.

Abstract

Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without skip (residual) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why skips improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong baselines, that incorporate skip connections, on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.

Paper Structure

This paper contains 26 sections, 3 theorems, 43 equations, 6 figures, 4 tables.

Key Result

Proposition 1

Let $S_\tau(\mathbf{M}_{\ell})\in\mathbb{R}^{n\times n}$ denote the row-wise softmax with temperature $\tau>0$. (Diffuse rows). If each row of $\mathbf{M}_{\ell}$ has a small range (difference between maximum and minimum)$\Delta{\ll}\tau$, then $S_\tau(\mathbf{M}_{\ell})$ is close to the uniform mat with $\varepsilon(\gamma/\tau){\to} 0$ as $\gamma/\tau {\to} \infty$. Hence $S_\tau(\mathbf{M}_{\el

Figures (6)

  • Figure 1: Supervised training loss of ViT-Base using AdamW (Left) and SOAP (Right) optimizers.
  • Figure 2: Performance of dense linear probing segmentation results using skip and skipless DINO ViT-Small models with AdamW and SOAP optimizers throughout the pretraining. The range of y-axis is the same for per column.
  • Figure 3: Pretrained DINO ViT-Small models for 300 epochs. For skipless models, we also evaluated checkpoint at 200 epochs. Performance on object discovery tasks using TokenCut on VOC2012 and COCO20k datasets.
  • Figure 4: Visualize learned representations from pretrained DINO models without cherry-picking.
  • Figure 5: Left: We choose $\alpha = 0.1, \beta = 5$ to ensure diagonal dominance. Right: We choose $\alpha=0.1, \beta=0$.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition 1: Softmax conditioning: diagonal dominance vs. diffuse rows
  • Proposition 2
  • Lemma 1
  • proof
  • proof