ViT-5: Vision Transformers for The Mid-2020s
Feng Wang, Sucheng Ren, Tiezheng Zhang, Predrag Neskovic, Anand Bhattad, Cihang Xie, Alan Yuille
TL;DR
This paper tackles the stagnation in Vision Transformer (ViT) design by performing a modular, component-wise modernization of the plain ViT backbone. It systematically applies advancements from recent transformer literature—LayerScale, RMSNorm, QK-Norm, 2D RoPE with absolute embeddings, and register tokens—while avoiding over-gating by not combining SwiGLU with LayerScale. The resulting ViT-5 backbone achieves state-of-the-art performance on ImageNet-1k classification (84.2% top-1 for ViT-5-B) and improves generative and dense-prediction capabilities, notably attaining a 1.84 FID in diffusion-based image generation and higher ADE20K mIoU across model scales. The findings underscore that a principled, component-based upgrade path can yield substantial gains, offering a practical drop-in backbone for mid-2020s vision and multimodal systems.
Abstract
This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.
