Learning Object Focused Attention

Vivek Trivedy; Amani Almalki; Longin Jan Latecki

Learning Object Focused Attention

Vivek Trivedy, Amani Almalki, Longin Jan Latecki

TL;DR

This paper addresses the tendency of Vision Transformers to rely on background cues by introducing Object Focused Attention (OFA), an auxiliary training loss that aligns the transformer's attention with object-centric patch relationships. OFA constructs a Patch Adjacency Matrix B from segmentation masks and enforces the pre-attention matrix S to match a masked, background-free version S'' through the loss L_OFA = ||S'' - B''||_2, applied at selected layers to guide learning without increasing inference cost. The method is integrated with MUSIQ and can pair with self-supervised MAE, including a novel multiscale masking strategy, yielding improved multilabel performance on MS COCO and Pascal VOC2012 and enhanced robustness to out-of-distribution backgrounds and texture biases. Empirical results show significant gains over baselines, state-of-the-art performance when combined with MAE and multi-scale representations, and evidence that OFA drives learning of holistic object shapes rather than spurious textures, with promising avenues for semi-supervised and panoptic segmentation extensions.

Abstract

We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.

Learning Object Focused Attention

TL;DR

Abstract

Learning Object Focused Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)