Table of Contents
Fetching ...

ViT-5: Vision Transformers for The Mid-2020s

Feng Wang, Sucheng Ren, Tiezheng Zhang, Predrag Neskovic, Anand Bhattad, Cihang Xie, Alan Yuille

TL;DR

This paper tackles the stagnation in Vision Transformer (ViT) design by performing a modular, component-wise modernization of the plain ViT backbone. It systematically applies advancements from recent transformer literature—LayerScale, RMSNorm, QK-Norm, 2D RoPE with absolute embeddings, and register tokens—while avoiding over-gating by not combining SwiGLU with LayerScale. The resulting ViT-5 backbone achieves state-of-the-art performance on ImageNet-1k classification (84.2% top-1 for ViT-5-B) and improves generative and dense-prediction capabilities, notably attaining a 1.84 FID in diffusion-based image generation and higher ADE20K mIoU across model scales. The findings underscore that a principled, component-based upgrade path can yield substantial gains, offering a practical drop-in backbone for mid-2020s vision and multimodal systems.

Abstract

This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.

ViT-5: Vision Transformers for The Mid-2020s

TL;DR

This paper tackles the stagnation in Vision Transformer (ViT) design by performing a modular, component-wise modernization of the plain ViT backbone. It systematically applies advancements from recent transformer literature—LayerScale, RMSNorm, QK-Norm, 2D RoPE with absolute embeddings, and register tokens—while avoiding over-gating by not combining SwiGLU with LayerScale. The resulting ViT-5 backbone achieves state-of-the-art performance on ImageNet-1k classification (84.2% top-1 for ViT-5-B) and improves generative and dense-prediction capabilities, notably attaining a 1.84 FID in diffusion-based image generation and higher ADE20K mIoU across model scales. The findings underscore that a principled, component-based upgrade path can yield substantial gains, offering a practical drop-in backbone for mid-2020s vision and multimodal systems.

Abstract

This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.
Paper Structure (29 sections, 5 equations, 8 figures, 13 tables)

This paper contains 29 sections, 5 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Overview of ViT-5 architecture. We conduct in-depth analyses and modernize the ViT architecture by refining its components including activation scaling, normalization, positional embeddings, registers, bias terms, and etc.
  • Figure 2: Undesired invariance when discarding APE. The two images are equivalent for ViTs with only RoPE as their positional embedding. Absolute position is needed for generic vision backbones, so ViT-5 takes both APE and RoPE as default components.
  • Figure 3: Performance at dynamic resolutions. All models are trained at 224$^2$ and then tested at different input sizes.
  • Figure 4: Attention visualization of DeiT-III-L and ViT-5-L at 384$\times$384 resolution. ViT-5 exhibits improved spatial understanding, characterized by clearer and more accurate self-attention activations. This improvement primarily arises from the combined effects of relative positional embeddings and registers.
  • Figure 5: QKNorm improves training stability. We show epoch-wise test loss of a ViT-5-Small with and without QKNorm, while the former converges smoothly and does not show spikes.
  • ...and 3 more figures