Table of Contents
Fetching ...

Sub-token ViT Embedding via Stochastic Resonance Transformers

Dong Lao, Yangchao Wu, Tian Yu Liu, Alex Wong, Stefano Soatto

TL;DR

The paper tackles the spatial granularity limits of Vision Transformers (ViTs) caused by non-overlapping tokenization by introducing Stochastic Resonance Transformer (SRT), a training-free, test-time ensemble that perturbs input coordinates with sub-token translations, processes multiple perturbed views, and aggregates their high-resolution embeddings to produce a $N\times M\times C$ feature field. SRT preserves the original ViT forward pass and weights, making it applicable to any layer and task, and can be followed by simple pooling to map back to the tokenized scale. Empirically, SRT yields consistent gains across dense and non-dense vision tasks, including semi-supervised video object segmentation (up to 4.1% for ViT-S/16), monocular depth prediction (up to 14.9% on RMSE for linear heads), and unsupervised saliency segmentation (average 1.8%), while enabling feature visualization and a potential path to distillation to reduce inference cost. The approach offers a lightweight, scalable way to recover fine-grained spatial structure from ViT embeddings, with demonstrated applicability to diverse models and tasks and potential extensions to other perturbation groups and architectures.

Abstract

Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across any layer of any ViT architecture, consistently boosting performance on several tasks including segmentation, classification, depth estimation, and others by up to 14.9% without the need for any fine-tuning.

Sub-token ViT Embedding via Stochastic Resonance Transformers

TL;DR

The paper tackles the spatial granularity limits of Vision Transformers (ViTs) caused by non-overlapping tokenization by introducing Stochastic Resonance Transformer (SRT), a training-free, test-time ensemble that perturbs input coordinates with sub-token translations, processes multiple perturbed views, and aggregates their high-resolution embeddings to produce a feature field. SRT preserves the original ViT forward pass and weights, making it applicable to any layer and task, and can be followed by simple pooling to map back to the tokenized scale. Empirically, SRT yields consistent gains across dense and non-dense vision tasks, including semi-supervised video object segmentation (up to 4.1% for ViT-S/16), monocular depth prediction (up to 14.9% on RMSE for linear heads), and unsupervised saliency segmentation (average 1.8%), while enabling feature visualization and a potential path to distillation to reduce inference cost. The approach offers a lightweight, scalable way to recover fine-grained spatial structure from ViT embeddings, with demonstrated applicability to diverse models and tasks and potential extensions to other perturbation groups and architectures.

Abstract

Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across any layer of any ViT architecture, consistently boosting performance on several tasks including segmentation, classification, depth estimation, and others by up to 14.9% without the need for any fine-tuning.
Paper Structure (19 sections, 9 equations, 6 figures, 7 tables)

This paper contains 19 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Schematic for SRT. SRT applies controlled perturbations to input images, extracting features through Vision Transformers (ViTs). These features are then upsampled to higher resolution and aligned using the inverse of the applied perturbations. Statistical aggregation, including mean and median, along the perturbation dimension, produces fine-grained feature representations. These features find utility in visualization and can also be seamlessly integrated back into the network for enhanced performance in downstream tasks.
  • Figure 2: High-resolution ViT features computed by stochastic resonance. Stochastic Resonance enables enhancing tokenized ViT features during inference without the need for additional training or modifying ViT forward pass. Here we present enhanced features from different pre-trained ViT models, visualized via Principal Component Analysis: CLIP radford2021learning captures major image components. Interestingly, although Supervised dosovitskiy2020image and DINO caron2021emerging are trained by different pipelines and training loss, they prioritize similar regions. This may be due to they are trained on the same dataset and thus capture similar inductive bias. In contrast, SAM kirillov2023segment and MAE he2022masked capture local features over high-level semantics.
  • Figure 3: Relative improvement on DAVIS-2017 dataset vs different noise levels. There exists an inherent trade-off between perturbation level and performance gain. Smaller perturbation ranges result in weaker improvements from the baseline model due to lower input diversity, while larger perturbations are susceptible to greater information loss. 3 pixels is found to be the optimal perturbation level on both ViT-S/16 and Vit-B/16.
  • Figure 4: Noise distribution in the features by SRT. Considering the ensembled features represent a "denoised" signal, we visualize noise distribution, which aligns to semantic boundaries (2nd row) where image patches do not align with object shape due to quantization. For reference, we also show the difference between resized features (by interpolation) to the original feature, which shows a less meaningful grid pattern.
  • Figure 5: Comparing SRT features and resized single-forward-pass features. SRT features respect the semantic boundaries better than resized features (The edge of the bird and pedals in column one and wheels of the bicycle in column five). Resized features contain quantization artifacts where edges are vertical and horizontal lines corners are right angle corners. Our feature can represent much more detailed object contours.
  • ...and 1 more figures