Table of Contents
Fetching ...

GTA: Guided Transfer of Spatial Attention from Object-Centric Representations

SeokHyun Seo, Jinwoo Hong, JungWoo Chae, Kyungyul Kim, Sangheum Hwang

TL;DR

This paper tackles the problem that Vision Transformers (ViT) lose valuable object-localization representations when fine-tuned on small datasets due to their low inductive bias. It introduces Guided Transfer of spatial Attention (GTA), a simple $L_2$-based regularization that aligns the attention logits of a downstream target ViT with those of a pre-trained source model, focusing on the [CLS] token's spatial mixing coefficients. GTA substantially improves transfer learning performance across five fine-grained datasets, with especially large gains in data-scarce regimes, and also enhances segmentation quality while synergizing with TransMix. The work demonstrates that regulating attention logits is an effective, generalizable strategy for preserving transferable localization properties in ViT during TL. The approach is simple to implement, broadly compatible with SSL and SL pretraining, and offers practical benefits for rapid adaptation of ViT to new tasks with limited labeled data.

Abstract

Utilizing well-trained representations in transfer learning often results in superior performance and faster convergence compared to training from scratch. However, even if such good representations are transferred, a model can easily overfit the limited training dataset and lose the valuable properties of the transferred representations. This phenomenon is more severe in ViT due to its low inductive bias. Through experimental analysis using attention maps in ViT, we observe that the rich representations deteriorate when trained on a small dataset. Motivated by this finding, we propose a novel and simple regularization method for ViT called Guided Transfer of spatial Attention (GTA). Our proposed method regularizes the self-attention maps between the source and target models. A target model can fully exploit the knowledge related to object localization properties through this explicit regularization. Our experimental results show that the proposed GTA consistently improves the accuracy across five benchmark datasets especially when the number of training data is small.

GTA: Guided Transfer of Spatial Attention from Object-Centric Representations

TL;DR

This paper tackles the problem that Vision Transformers (ViT) lose valuable object-localization representations when fine-tuned on small datasets due to their low inductive bias. It introduces Guided Transfer of spatial Attention (GTA), a simple -based regularization that aligns the attention logits of a downstream target ViT with those of a pre-trained source model, focusing on the [CLS] token's spatial mixing coefficients. GTA substantially improves transfer learning performance across five fine-grained datasets, with especially large gains in data-scarce regimes, and also enhances segmentation quality while synergizing with TransMix. The work demonstrates that regulating attention logits is an effective, generalizable strategy for preserving transferable localization properties in ViT during TL. The approach is simple to implement, broadly compatible with SSL and SL pretraining, and offers practical benefits for rapid adaptation of ViT to new tasks with limited labeled data.

Abstract

Utilizing well-trained representations in transfer learning often results in superior performance and faster convergence compared to training from scratch. However, even if such good representations are transferred, a model can easily overfit the limited training dataset and lose the valuable properties of the transferred representations. This phenomenon is more severe in ViT due to its low inductive bias. Through experimental analysis using attention maps in ViT, we observe that the rich representations deteriorate when trained on a small dataset. Motivated by this finding, we propose a novel and simple regularization method for ViT called Guided Transfer of spatial Attention (GTA). Our proposed method regularizes the self-attention maps between the source and target models. A target model can fully exploit the knowledge related to object localization properties through this explicit regularization. Our experimental results show that the proposed GTA consistently improves the accuracy across five benchmark datasets especially when the number of training data is small.
Paper Structure (21 sections, 5 equations, 5 figures, 7 tables)

This paper contains 21 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of self-attention maps from pre-trained, naı̈vely fine-tuned, and GTA-traind models. The self-attention maps of the multiple heads are aggregated with max values, and visualized in red color. Each column shows the attention maps from the models that are pre-trained, fine-tuned, and fine-tuned with GTA on 15% and 100% of training data, respectively. GTA shows that it is capable of fully leveraging well-trained representations learned by the upstream task.
  • Figure 1: Comparison of self-attention maps from pre-trained, naı̈vely fine-tuned, and GTA-traind models across multiple datasets. We consider CUB, Cars, Aircraft, Dogs, and Pets datasets. The self-attention maps of the multiple heads are aggregated with maximum values, and visualized in red color. Each column shows the attention maps from the models that are pre-trained using SSL, fine-tuned, and fine-tuned with GTA on 15% and 100% of training data, respectively. GTA shows that it is capable of fully leveraging object-centric representations learned by the SSL model.
  • Figure 2: The overall pipeline of the proposed GTA. An image is first fed into both the frozen source model and the trainable target model. By minimizing the $L_2$ distance between the attention logits from each model, the target model is optimized for the current task while focusing on the image tokens that require attention by exploiting the source model.
  • Figure 3: Comparison of segmentation results on PASCAL-VOC12. Pre-trained refers to the segmentation results obtained by the attention logits of the upstream. The baseline represents the results obtained by fine-tuning the pre-trained model to target task. GTA denotes the results obtained by utilizing the GTA during fine-tuning. GTA shows optimized performance compared to the other results.
  • Figure 4: The effect of different values of $\lambda$ on GTA. The optimal lambda value varies depending on the characteristics and amount of the target data.