Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu, Radu Soricut
TL;DR
The paper addresses the inefficiency of the standard patch-convolution image tokenizer in Vision Transformers by introducing a wavelet-based tokenizer. It leverages pixel-space embeddings from a multilevel wavelet transform and a block-sparse projection to produce compact semantic tokens, enabling efficient high-resolution processing. Theoretical analysis clarifies why throughput improves and how the approach naturally resists adversarial perturbations, while experiments on ImageNet show higher top-1 accuracy and increased training throughput with reduced parameter counts. The work suggests promising future directions, including non-uniform token grids and scaling to even higher-resolution imagery in ViT-based models.
Abstract
Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on improving the image tokenizer itself has been reported in the literature. In this paper, we propose a new image tokenizer based on wavelet transformation. We show that ViT models with the new tokenizer achieve both higher training throughput and better top-1 precision for the ImageNet validation set. We present a theoretical analysis on why the proposed tokenizer improves the training throughput without any change to ViT model architecture. Our analysis suggests that the new tokenizer can effectively handle high-resolution images and is naturally resistant to adversarial attack. Furthermore, the proposed image tokenizer offers a fresh perspective on important new research directions for ViT-based model design, such as image tokens on a non-uniform grid for image understanding.
