Table of Contents
Fetching ...

Linearizing Vision Transformer with Test-Time Training

Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang, Gao Huang

Abstract

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions.

Linearizing Vision Transformer with Test-Time Training

Abstract

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4H20 GPUs, SD3.5-T achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32 and 1.47 at 1K and 2K resolutions.

Paper Structure

This paper contains 33 sections, 15 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Left: 2K image generated by SD3.5-T$^5$; right: 1K images generated by SD3.5-T$^5$.
  • Figure 2: Structural similarity between Softmax Attention and two-layer TTT enables direct weight inheritance and fast adaptation.
  • Figure 3: Top: Softmax absorbs key shifts while TTT does not. Our method recenters the keys when inheriting pretrained weights. Bottom: Distribution of key shift ratio across 5K images: pretrained ViT exhibits ratio $\approx 0.5$, indicating substantial key bias, while random initialization yields ratio $\approx 0.07$.
  • Figure 4: Visualizations of implicit attention scores. Softmax attention exhibits strong local bias. While TTT yield meaningful attention distributions, it focus more on global modeling. DWC$_{QK}$ enhance the locality.
  • Figure 5: FLOPs versus resolution on DeiT (left) and DiT (right). The efficiency advantage of $\mathrm{T}^5$ becomes more pronounced as the sequence length grows.
  • ...and 1 more figures