Table of Contents
Fetching ...

Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers

Anthony Fuller, Yousef Yassin, Daniel G. Kyrollos, Evan Shelhamer, James R. Green

TL;DR

The paper introduces Jumbo, a wide global token for plain, non-hierarchical Vision Transformers to boost capacity without sacrificing speed. By widthening a single Jumbo token to J×D and sharing its dedicated FFN across layers, Jumbo expands global processing while keeping the ViT interface intact. Empirical results across ImageNet-1K/21K, MAE pretraining, robustness, and time-series tasks show consistent speed-accuracy gains and better Pareto frontiers than prior compute-efficient approaches. The approach maintains broad compatibility with SSL and multimodal/or non-2D data, offering a practical, scalable upgrade path for plain ViTs.

Abstract

ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. While many non-ViT architectures are both fast and accurate, they cannot flexibly process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more like ViTs can. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new and old. Jumbo improves over ViT baselines with Registers from Nano to Large scales while maintaining speed/throughput on ImageNet-1K (0.1-13%). Jumbo also improves MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Our Jumbo models even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality. Code and weights available: https://github.com/antofuller/jumbo

Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers

TL;DR

The paper introduces Jumbo, a wide global token for plain, non-hierarchical Vision Transformers to boost capacity without sacrificing speed. By widthening a single Jumbo token to J×D and sharing its dedicated FFN across layers, Jumbo expands global processing while keeping the ViT interface intact. Empirical results across ImageNet-1K/21K, MAE pretraining, robustness, and time-series tasks show consistent speed-accuracy gains and better Pareto frontiers than prior compute-efficient approaches. The approach maintains broad compatibility with SSL and multimodal/or non-2D data, offering a practical, scalable upgrade path for plain ViTs.

Abstract

ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. While many non-ViT architectures are both fast and accurate, they cannot flexibly process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more like ViTs can. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new and old. Jumbo improves over ViT baselines with Registers from Nano to Large scales while maintaining speed/throughput on ImageNet-1K (0.1-13%). Jumbo also improves MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Our Jumbo models even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality. Code and weights available: https://github.com/antofuller/jumbo

Paper Structure

This paper contains 26 sections, 3 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Plain ViTs are in red, and others are in blue. ViT+Jumbo outperforms SOTA compute-efficient architectures --- while maintaining the advantages of plain ViTs. ViT+Jumbo outperforms ViT+Registers on ImageNet-1K and the more challenging ImageNet-21K dataset. Throughput is measured on an RTX 4090 GPU using PyTorch 2.6.0, torch.compile, and a $512$ batch size.
  • Figure 2: (Left) Our ViT+Jumbo method creates a wide global token that gets split into several tokens, with width equal to the patch width, prior to multi-headed self-attention (MHSA). After attention, the split Jumbo token is reassembled via concatenation, and is then processed by its own FFN. Patches are processed by their own, shared FFN. (Right) ViT+Registers creates register tokens all equal to the patch width --- and all tokens are processed by a shared FFN. ViT+Jumbo enhances global processing as the (split) global tokens can interact via an expressive FFN, plus attention.
  • Figure 3: The cost of layers is largely determined by the number of patches and their width $D$. The cost of our Jumbo token ($J{=}6$) is negligible.
  • Figure 4: ViT+Jumbo achieves the Pareto frontier and is much simpler than specialized compute-efficient architectures. Results are plotted for each model's best learning rate. Throughput is measured on an RTX 4090 GPU using PyTorch 2.6.0, torch.compile, and a $512$ batch size.
  • Figure 5: Jumbo (left two subfigures) eliminates high-norm, outlier tokens in our measurements. According to Darcet et al. darcet2024vision, outlier tokens cause attention-map artifacts, and their presence can be reduced by adding registers (right two subfigures). By inspection, Jumbo also learns artifact-free attention maps, and split Jumbo tokens seem to specialize.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proof