Table of Contents
Fetching ...

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen

TL;DR

This work questions the necessity of locality as an inductive bias in vision models by applying Transformer architectures directly to individual pixels as tokens, with learnable position embeddings and no 2D grid priors. Across three case studies—supervised classification/ regression, self-supervised MAE pretraining, and diffusion-based image generation—the pixel-token Transformer demonstrates competitive or superior performance relative to patch-based ViT baselines, despite substantially longer sequence lengths (up to L = H · W). The study also analyzes two locality designs in ViT, showing patchification exerts a stronger locality bias than position embeddings, and demonstrates that removing locality is feasible but challenging in practice due to computational costs. The findings advocate rethinking inductive biases in vision architectures and highlight that locality is not strictly fundamental, though patch-based methods remain effective trade-offs between accuracy and efficiency. Overall, the work broadens the design space for future vision models by validating locality-free Transformers as a viable research direction under scalable computation and diverse tasks, including generative modeling with latent-token space representations.

Abstract

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We showcase the effectiveness of pixels-as-tokens across three well-studied computer vision tasks: supervised learning for classification and regression, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although it's computationally less practical to directly operate on individual pixels, we believe the community must be made aware of this surprising piece of knowledge when devising the next generation of neural network architectures for computer vision.

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

TL;DR

This work questions the necessity of locality as an inductive bias in vision models by applying Transformer architectures directly to individual pixels as tokens, with learnable position embeddings and no 2D grid priors. Across three case studies—supervised classification/ regression, self-supervised MAE pretraining, and diffusion-based image generation—the pixel-token Transformer demonstrates competitive or superior performance relative to patch-based ViT baselines, despite substantially longer sequence lengths (up to L = H · W). The study also analyzes two locality designs in ViT, showing patchification exerts a stronger locality bias than position embeddings, and demonstrates that removing locality is feasible but challenging in practice due to computational costs. The findings advocate rethinking inductive biases in vision architectures and highlight that locality is not strictly fundamental, though patch-based methods remain effective trade-offs between accuracy and efficiency. Overall, the work broadens the design space for future vision models by validating locality-free Transformers as a viable research direction under scalable computation and diverse tasks, including generative modeling with latent-token space representations.

Abstract

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We showcase the effectiveness of pixels-as-tokens across three well-studied computer vision tasks: supervised learning for classification and regression, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although it's computationally less practical to directly operate on individual pixels, we believe the community must be made aware of this surprising piece of knowledge when devising the next generation of neural network architectures for computer vision.
Paper Structure (45 sections, 2 equations, 9 figures, 9 tables)

This paper contains 45 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Transformer on pixels, or 1$\times$1 'patches', which we use to investigate the role of locality. Given an image, we simply treat it as a set of pixels. Besides, we also use randomly initialized and learnable position embeddings without prior about the 2D image grid, thus removing the remaining locality bias from previous architectures (e.g., ViT Dosovitskiy2014) that operate on non-degenerated patches. Transformers are employed on the top, with interleaved Self-Attention and MLP blocks (only showing one pair for clarity). We showcase the effectiveness of this locality-free architecture through three case studies, spanning both discriminative and generative tasks.
  • Figure 2: Two trends with ViT. Since our Transformer on pixels can be viewed as ViT with patch size 1$\times$1, the trends w.r.t. patch size is crucial to our finding. In (a), we vary the ViT-B patch size but keep the sequence length fixed (last data point is locality-free) -- so the input size is also varied. While Acc@1 remains constant in the beginning, the input size, or the amount of information quickly becomes the dominating factor that deteriorates accuracy. On the other hand, in (b) we vary the ViT-S patch size while keeping the input size fixed. The trend is opposite -- reducing the patch size is always helpful and the last point (locality-free) becomes the best. The juxtaposition of these two trends gives a more complete picture of the relationship between input size, patch size and sequence length.
  • Figure 3: Qualitative results for case study #3: image generation. These 256$\times$256 samples are generated from ImageNet-trained DiTs Peebles2023. For direct comparisons, we fix random seeds and categories to prompt the model (none of the people classes from ImageNet are used), with the only difference that (a) uses locality-biased DiT-L/2, and (b) uses the locality-free variant (DiT-L/1). Overall, generations from locality-free models have fine features and detailed and reasonable, similar to locality-based models.
  • Figure 4: Pixel permutation for ViT. We swap pixels within a Hamming distance of $\delta$ and do this $T$ times (no distance constraint if $\delta=\inf$). Illustrated is an 8$\times$8 image divided into 2$\times$2 patches. Here we show permutation with $T=4$ pixel swaps (denoted by double-headed arrows).
  • Figure 5: Results of pixel permutation for ViT-B on ImageNet. We vary the number of pixel swaps $T$ (left) and additionally varying the maximum distance of pixel swaps $\delta$ (right). Pixel permutations can drop accuracy by 25.2% (when 25K pairs are swapped), compared to the relatively minor drop (1.6%) when the position embedding is completely removed. And when farther-away pixels are allowed for swapping, more damage is caused. These results suggest pixel permutation imposes a much more significant impact on performance, compared to swapping position embeddings.
  • ...and 4 more figures