Sub-token ViT Embedding via Stochastic Resonance Transformers
Dong Lao, Yangchao Wu, Tian Yu Liu, Alex Wong, Stefano Soatto
TL;DR
The paper tackles the spatial granularity limits of Vision Transformers (ViTs) caused by non-overlapping tokenization by introducing Stochastic Resonance Transformer (SRT), a training-free, test-time ensemble that perturbs input coordinates with sub-token translations, processes multiple perturbed views, and aggregates their high-resolution embeddings to produce a $N\times M\times C$ feature field. SRT preserves the original ViT forward pass and weights, making it applicable to any layer and task, and can be followed by simple pooling to map back to the tokenized scale. Empirically, SRT yields consistent gains across dense and non-dense vision tasks, including semi-supervised video object segmentation (up to 4.1% for ViT-S/16), monocular depth prediction (up to 14.9% on RMSE for linear heads), and unsupervised saliency segmentation (average 1.8%), while enabling feature visualization and a potential path to distillation to reduce inference cost. The approach offers a lightweight, scalable way to recover fine-grained spatial structure from ViT embeddings, with demonstrated applicability to diverse models and tasks and potential extensions to other perturbation groups and architectures.
Abstract
Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across any layer of any ViT architecture, consistently boosting performance on several tasks including segmentation, classification, depth estimation, and others by up to 14.9% without the need for any fine-tuning.
