Table of Contents
Fetching ...

Learning Object Focused Attention

Vivek Trivedy, Amani Almalki, Longin Jan Latecki

TL;DR

This paper addresses the tendency of Vision Transformers to rely on background cues by introducing Object Focused Attention (OFA), an auxiliary training loss that aligns the transformer's attention with object-centric patch relationships. OFA constructs a Patch Adjacency Matrix B from segmentation masks and enforces the pre-attention matrix S to match a masked, background-free version S'' through the loss L_OFA = ||S'' - B''||_2, applied at selected layers to guide learning without increasing inference cost. The method is integrated with MUSIQ and can pair with self-supervised MAE, including a novel multiscale masking strategy, yielding improved multilabel performance on MS COCO and Pascal VOC2012 and enhanced robustness to out-of-distribution backgrounds and texture biases. Empirical results show significant gains over baselines, state-of-the-art performance when combined with MAE and multi-scale representations, and evidence that OFA drives learning of holistic object shapes rather than spurious textures, with promising avenues for semi-supervised and panoptic segmentation extensions.

Abstract

We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.

Learning Object Focused Attention

TL;DR

This paper addresses the tendency of Vision Transformers to rely on background cues by introducing Object Focused Attention (OFA), an auxiliary training loss that aligns the transformer's attention with object-centric patch relationships. OFA constructs a Patch Adjacency Matrix B from segmentation masks and enforces the pre-attention matrix S to match a masked, background-free version S'' through the loss L_OFA = ||S'' - B''||_2, applied at selected layers to guide learning without increasing inference cost. The method is integrated with MUSIQ and can pair with self-supervised MAE, including a novel multiscale masking strategy, yielding improved multilabel performance on MS COCO and Pascal VOC2012 and enhanced robustness to out-of-distribution backgrounds and texture biases. Empirical results show significant gains over baselines, state-of-the-art performance when combined with MAE and multi-scale representations, and evidence that OFA drives learning of holistic object shapes rather than spurious textures, with promising avenues for semi-supervised and panoptic segmentation extensions.

Abstract

We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.

Paper Structure

This paper contains 13 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: We restrict learning attention to objects of the same class.
  • Figure 2: The proposed object focused attention (OFA) as an extension of self-attention. The arrows are labeled with the input/output matrices. The right part of the diagram is based on the original self-attention paper AttentionNIPS2017. The left branch computes the OFA loss. The patch adjacency matrix (PAM) module is used to compute the patch adjacency matrix $\mathbf{B}$, which is then compared to the pre-attention matrix $\mathbf{S}$.
  • Figure 3: Data flow showing differences in training and inference. OFA is shown explicitly as a training time method and thus can be used without any segmentation labels during inference.
  • Figure 4: The multiscale masking is computed by masking the grid cells (left) and carrying over the masked cells to image patches that correspond to those cells.
  • Figure 5: Comparison of attention maps of proposed MUSIQ + OFA and baseline MUSIQ.
  • ...and 2 more figures