Table of Contents
Fetching ...

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

Michael A. Lepori, Alexa R. Tartaglini, Wai Keen Vong, Thomas Serre, Brenden M. Lake, Ellie Pavlick

TL;DR

This work presents a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different and finds evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks.

Abstract

Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failures at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

TL;DR

This work presents a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different and finds evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks.

Abstract

Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failures at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
Paper Structure (40 sections, 39 figures, 8 tables)

This paper contains 40 sections, 39 figures, 8 tables.

Figures (39)

  • Figure 1: Two same-different tasks. (a) Discrimination: "same" images contain two objects with the same color and shape. Objects in "different" images differ in at least one of those properties---in this case, both color and shape. (b) RMTS: "same" images contain a pair of objects that exhibit the same relation as a display pair of objects in the top left corner. In the image on the left, both pairs demonstrate a "different" relation, so the classification is "same" (relation). "Different" images contain pairs exhibiting different relations.
  • Figure 2: Attention Pattern Analysis. (a) CLIP Discrimination: The heatmap (top) shows the distribution of "local" (blue) vs. "global" (red) attention heads throughout a CLIP ViT-B/16 model fine-tuned on discrimination (Figure \ref{['fig:tasks']}a). The $x$-axis is the model layer, while the $y$-axis is the head index. Local heads tend to cluster in early layers and transition to global heads around layer 6. For each layer, the line graph (bottom) plots the maximum proportion of attention across all $12$ heads from object patches to image patches that are 1) within the same object (within-object$=$WO), 2) within the other object (within-pair$=$WP), or 3) in the background (BG). The stars mark the peak of each. WO attention peaks in early layers, followed by WP, and finally BG. (b) From Scratch Discrimination: We repeat the analysis in (a). The model contains nearly zero local heads. (c) CLIP RMTS: We repeat the analysis for a CLIP model fine-tuned on RMTS (Figure \ref{['fig:tasks']}b). Top: Our results largely hold from (a). Bottom: We track a fourth attention pattern---attention between pairs of objects (between pair$=$BP). We find that WO peaks first, then WP, then BP, and finally BG. This accords with the hierarchical computations implied by the RMTS task. (d) DINO RMTS: We repeat the analysis in (c) for a DINO model and find no such hierarchical pattern.
  • Figure 3: (a) Interchange interventions: The base image exhibits the "different" relation, as the two objects differ in either shape (top) or color (bottom). An interchange intervention extracts {shape, color}information from the intermediate representations generated by the same model run on a different image (source), then patches this information from the source image into the model's intermediate representations of the base image. If successful, the intervened model will now return "same" when run on the base image. DAS is optimized to succeed at interchange interventions. (b) Disentanglement Results: We report the success of interchange interventions on shape and color across layers for CLIP ViT-B/16 fine-tuned on either the discrimination or RMTS task. We find that these properties are disentangled early in the model---one property can be manipulated without interfering with the other. The background is colored according to the heatmap in Figure \ref{['fig:attention_heads']}a, where blue denotes local heads and red denotes global heads.
  • Figure 4: (a) Novel Representations Analysis: Using trained DAS interventions, we can inject any vector into a model's shape or color subspaces, allowing us to test whether the same-different operation can be computed over arbitrary vectors. We intervene on a "different" image---differing only in its color property---by patching a novel color (an interpolation of red and black) into both objects in order to flip the decision to "same". (b) Discrimination Results: We perform novel representations analysis using four methods for generating novel representations: 1) adding observed representations, 2) interpolating observed representations, 3) per-dimension sampling using a distribution derived from observed representations, and 4) sampling randomly from a normal distribution $\mathcal{N}(0,1)$. The model's same-different operation generalizes well to vectors generated by adding (and generalizes somewhat to interpolated vectors) in early layers but not to sampled or random vectors. The background is colored according to the heatmap in Figure \ref{['fig:attention_heads']}a (blue$=$local heads; red$=$global heads).
  • Figure 5: Linear probing and intervention results. We probe for the intermediate same-different judgments required to perform the RMTS task (blue). Probe performance reaches ceiling at around layer 5 and maintains throughout the rest of the model. We use the directions defined by the linear probe to intervene on model representations and flip an intermediate judgment (green). This intervention succeeds reliably at layer 5 but not deeper. We add a vector that is consistent with a pair's exhibited same-different relation as a control (yellow). This has little effect. The background is colored according to the heatmap in Figure \ref{['fig:attention_heads']}c (blue$=$local heads; red$=$global heads).
  • ...and 34 more figures