SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Ankit Vani; Bac Nguyen; Samuel Lavoie; Ranjay Krishna; Aaron Courville

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Ankit Vani, Bac Nguyen, Samuel Lavoie, Ranjay Krishna, Aaron Courville

TL;DR

SPARO is proposed, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head, that demonstrates improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP and DINO and provides insights through ablation experiments and visualization of learned concepts.

Abstract

Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using SPARO, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual SPARO concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of SPARO's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 5 figures, 14 tables)

This paper contains 40 sections, 2 equations, 5 figures, 14 tables.

Introduction
Related work
Transformer read-outs.
Slot representations.
Method
Notation
Separate-head attention read-out (Sparo)
CLIP with Sparo
DINO with Sparo
Results
Datasets.
Models.
Zero-shot recognition, robustness, and compositionality
Zero-shot image and text retrieval
Linear probe and DINO nearest neighbors classification
...and 25 more sections

Figures (5)

Figure 1: Illustration of Sparo, a read-out mechanism that structures representations as collections of separately-attended concepts. Take a standard $N$-block transformer encoder (ViT here as an example), producing an encoding ${\bm{y}}$ through extraction of its CLS token output. We can replace the $N$th transformer block with the Sparo module (typically with equal or fewer parameters) to produce a Sparo encoding ${\bm{y}}$, which is a concatenation of $L$Sparo slots. Each Sparo slot ${\bm{y}}_l$ is produced through single-head attention over the backbone outputs using an embedded query ${\bm{q}}_l$. The value projection is a composition of slot-specific key projection parameterized by ${\bm{K}}_l$ and a slot-wise projection shared between all Sparo concepts parameterized by ${\bm{W}}$.
Figure 2: Relative differences of CLIP+Sparo zero-shot accuracies when compared to CLIP+GAP on the VTAB benchmark.
Figure 3: Visualizing of the attended image and text positions for three Sparo slots (one per row) across four examples (one per column) from MS COCO. We surmise that the Sparo concepts from top to bottom represent the subject, activity, and location.
Figure B.1: Effect of varying $L$ and $V$ values for CLIP$^{32}$+Sparo, and the embedding size for CLIP$^{32}$ and CLIP$^{32}$+GAP, when training on CC15M.
Figure B.2: Additional visualizations of attended image and text positions for three Sparo slots (one per row). We surmise that the Sparo concepts from top to bottom represent animals, transportation, and seats. Notice that top-right and center-right examples have similar attention masks over 'horse,' but consider different aspects of the concept -- one as an animal, another as a mode of transportation.

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

TL;DR

Abstract

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (5)