Table of Contents
Fetching ...

Locality-Attending Vision Transformer

Sina Hajimiri, Farzad Beizaee, Fereshteh Shakeri, Christian Desrosiers, Ismail Ben Ayed, Jose Dolz

TL;DR

This work seeks to enhance segmentation performance of vision transformers after standard image-level classification training with a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities.

Abstract

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.

Locality-Attending Vision Transformer

TL;DR

This work seeks to enhance segmentation performance of vision transformers after standard image-level classification training with a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities.

Abstract

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.
Paper Structure (52 sections, 18 equations, 6 figures, 8 tables)

This paper contains 52 sections, 18 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Qualitative evaluation on the attention maps. The final attention maps (before the classification head) of ViT and LocAtViT for the [CLS] token and three patches are illustrated for an image with label school bus.
  • Figure 2: LocAt considerably enhances different baselines in segmentation, while preserving or even improving classification.
  • Figure 3: Illustration of the Gaussian-Augmented attention for a $3 \times 3$ grid. (a) The Gaussian addition is obtained based on the query and is added to the attention logits. The $p^\text{th}$ row in the attention logits matrix presents the attention of patch $p$ to all patch tokens. The reshaped matrix illustrates that with GAug, both local and global attentions are integrated. (b) The supplement matrix $\mathbf{S}$ encourages attending to the locality and is computed using the pairwise squared difference tensor $\mathbf{D}$ from \ref{['eq:pairwise-diff']}. For simplicity, the [CLS] token is not shown in this visualization, and Gaussian variances and scaling coefficients are set to a constant value for all patches.
  • Figure 4: Qualitative evaluation on the attention maps. The final attention maps (before the classification head) of ViT and LocAtViT for the [CLS] token and three different patches are illustrated for three different images from mini-ImageNet with labels: orange, Komondor, and corn.
  • Figure 5: Degradation of local features in vanilla ViT. Features in ViT collapse to the global information in the last layers while in LocAtViT, patch features encode local information.
  • ...and 1 more figures