Table of Contents
Fetching ...

Denoising Vision Transformers

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, Yue Wang

TL;DR

This paper identifies persistent grid-like artifacts in Vision Transformer feature maps that correlate with positional embeddings and hinder dense prediction tasks. It introduces Denoising Vision Transformers (DVT), a two-stage approach that first decomposes ViT outputs per image using neural fields to separate semantics, position-related artifacts, and residuals, and then trains a lightweight denoiser to predict clean features for online use. Across six ViTs and multiple dense vision tasks, DVT yields consistent performance gains, sometimes surpassing larger models while adding only a small parameter footprint for the denoiser. The work highlights the importance of reconsidering positional embeddings in ViT design and demonstrates a practical, plug-in denoiser that improves robustness and interpretability of ViT features for dense vision applications.

Abstract

We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.

Denoising Vision Transformers

TL;DR

This paper identifies persistent grid-like artifacts in Vision Transformer feature maps that correlate with positional embeddings and hinder dense prediction tasks. It introduces Denoising Vision Transformers (DVT), a two-stage approach that first decomposes ViT outputs per image using neural fields to separate semantics, position-related artifacts, and residuals, and then trains a lightweight denoiser to predict clean features for online use. Across six ViTs and multiple dense vision tasks, DVT yields consistent performance gains, sometimes surpassing larger models while adding only a small parameter footprint for the denoiser. The work highlights the importance of reconsidering positional embeddings in ViT design and demonstrates a practical, plug-in denoiser that improves robustness and interpretability of ViT features for dense vision applications.

Abstract

We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
Paper Structure (54 sections, 6 equations, 17 figures, 9 tables)

This paper contains 54 sections, 6 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Denoising Vision Transformers (DVT) effectively suppresses noisy artifacts in the visual features of all Vision Transformers (ViTs) we have tested and improves performance on a broad spectrum of dense prediction tasks, including semantic segmentation, depth estimation, object detection, and object discovery. Our evaluation encompasses a representative set of ViTs, including DINOv2 oquab2023dinov2, DeiT-III touvron2022deit, EVA-02 fang2023eva, CLIP radford2021learning, and DINOv2-reg darcet2023vision. We visualize the features before and after DVT, colored via principal component analysis (PCA). Best viewed in color. Right: We report the downstream dense prediction task performances, averaged over all models.
  • Figure 2: Artifacts hurt semantic coherence. For each triplet, we show a feature map, a K-Means cluster map, and a similarity map of the central patch (red dotted) with other patches in the image. Observe how artifacts negatively impact clustering accuracy and similarity correspondences, and how our denoising mitigates these issues.
  • Figure 3: Impact of positional embeddings in ViTs. (a) Comparison between DINOv2 ViTs oquab2023dinov2 trained with and without positional embeddings (("ViT" v.s. "ViT$^*$"). We show feature maps from (1) a standard ViT, (2) a ViT using only positional embeddings (PE) as input, emphasizing the emergence of artifacts, and (3) a PE-free ViT$^*$, displaying a clear absence of these artifacts. In the figure, "Patch": patch embedding, "PE": position embedding. (b) Illustration of how ViT retains and propagates the positional embeddings. (c) Despite significant differences in the context of various frames, the artifacts largely maintain a consistent relative position in the images (central row). Our DVT effectively denoises these artifacts, demonstrated in the final row.
  • Figure 4: Method Overview. DVT consists of a two-stage denoising pipeline. (a) In the first stage, our method decomposes the raw feature of an image crop into a noise-free semantics term $\mathcal{F}$, an input-independent, position-related artifact term $\mathcal{G}$, and an additional residual term $\Delta$. (b) In the second stage, we train a generalizable denoiser to predict clean features from their original features. At inference time, only a single feedforward is needed to obtain denoised features.
  • Figure 5: Visual analysis of ViT output features and denoised features. (a) Visualizations of the feature maps from all layers of a DINOv2 oquab2023dinov2 ViT-base model. Notably, the artifacts in the feature maps derived from the cat image exhibit a strong visual correlation with those from the zero-tensor inputs. (b) Visualizations of the decomposed artifacts, the original features, and the denoised features across various layers of DINOv2 ViTs. We observe similar patterns in differently-sized models.
  • ...and 12 more figures