Slot Abstractors: Toward Scalable Abstract Visual Reasoning

Shanka Subhra Mondal; Jonathan D. Cohen; Taylor W. Webb

Slot Abstractors: Toward Scalable Abstract Visual Reasoning

Shanka Subhra Mondal, Jonathan D. Cohen, Taylor W. Webb

TL;DR

Slot Abstractors fuse object-centric Slot Attention with Abstractors to achieve scalable abstract visual reasoning across scenes with many objects and multiple relations. The architecture maintains $O(K^2)$ complexity while modeling higher-order relations via multi-head relational cross-attention, enabling strong systematic generalization on ART, SVRT, CLEVR-ART, PGM, and V-PROM, including a real-world image task. Extensive ablations show the necessity of slot attention, factorized slot representations, and the relational bottleneck for performance gains. The results demonstrate practical advances toward human-like abstract reasoning in vision systems and point to future work on dynamic slot counts and more efficient attention mechanisms.

Abstract

Abstract visual reasoning is a characteristically human ability, allowing the identification of relational patterns that are abstracted away from object features, and the systematic generalization of those patterns to unseen problems. Recent work has demonstrated strong systematic generalization in visual reasoning tasks involving multi-object inputs, through the integration of slot-based methods used for extracting object-centric representations coupled with strong inductive biases for relational abstraction. However, this approach was limited to problems containing a single rule, and was not scalable to visual reasoning problems containing a large number of objects. Other recent work proposed Abstractors, an extension of Transformers that incorporates strong relational inductive biases, thereby inheriting the Transformer's scalability and multi-head architecture, but it has yet to be demonstrated how this approach might be applied to multi-object visual inputs. Here we combine the strengths of the above approaches and propose Slot Abstractors, an approach to abstract visual reasoning that can be scaled to problems involving a large number of objects and multiple relations among them. The approach displays state-of-the-art performance across four abstract visual reasoning tasks, as well as an abstract reasoning task involving real-world images.

Slot Abstractors: Toward Scalable Abstract Visual Reasoning

TL;DR

Slot Abstractors fuse object-centric Slot Attention with Abstractors to achieve scalable abstract visual reasoning across scenes with many objects and multiple relations. The architecture maintains

complexity while modeling higher-order relations via multi-head relational cross-attention, enabling strong systematic generalization on ART, SVRT, CLEVR-ART, PGM, and V-PROM, including a real-world image task. Extensive ablations show the necessity of slot attention, factorized slot representations, and the relational bottleneck for performance gains. The results demonstrate practical advances toward human-like abstract reasoning in vision systems and point to future work on dynamic slot counts and more efficient attention mechanisms.

Abstract

Paper Structure (26 sections, 5 equations, 9 figures, 16 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 9 figures, 16 tables, 1 algorithm.

Introduction
Approach
Object-Centric Representation Learning
Relational Representation Learning
Related Work
Experiments
Datasets
ART
SVRT
CLEVR-ART
PGM
V-PROM
Baselines
Experimental Details
Results
...and 11 more sections

Figures (9)

Figure 1: Slot Abstractor. The Slot Abstractor consists of two major components. First, object-centric representations are extracted using Slot Attention locatello2020object. Relation embeddings are then computed using a series of Abstractor layers altabaa2023abstractors. Example problem on left is from the PGM dataset barrett2018measuring, consisting of a $3\times3$ matrix of image panels populated with objects. The task is to identify the abstract pattern among the image panels, and use this pattern to fill in the missing panel (bottom right), selecting from a set of eight choices. To generate scores for each answer choice, the corresponding image panel is inserted into the problem. Slot attention is then used to extract feature embeddings $\boldsymbol{z}_{k=1...K}$, and position embeddings $\boldsymbol{m}_{k=1...K}$ for each panel. Relation embeddings $\boldsymbol{s}$ are then computed through a series of Abstractor layers. Each layer consists of relational cross-attention, self-attention, and feedforward layers, with residual connections after each of these. Relational cross-attention uses feature embeddings to generate keys $k$ and queries $q$, and the relation embeddings from the previous layer to generate values $v$. Relation embeddings are initialized using position embeddings. After $L$ Abstractor layers, relation embeddings are averaged and passed through a linear layer to generate a score $y$.
Figure 2: Visualization of Abstractor's Relation Embeddings. The output of the first head of relational cross attention, after projecting to the first two principal components for 100 examples from the test set of same/different ART task. Two different clusters are formed corresponding to problems with the same and different relation among the objects.
Figure 3: Abstract Reasoning Tasks (ART) Dataset. The 'same/different' task, requires identifying whether two objects are the same or different. The 'relational-match-to-sample' task requires selecting a pair of objects, out of two pairs called the target objects that has the same relation, as the relation ('same' or 'different') among a source pair of objects. We presented the problem as a $2\times2$ array format, with the source pair of objects in the top row, and a target pair in the bottom row (separate images for each target pair). In the 'distribution-of-3' task, the first row contains a set of three objects, and the second row contains an incomplete set. The task is to select the missing object from a set of four choices. We presented the problem as a $2\times3$ array format, with one of the choices inserted in the bottom right cell (separate images for each choice). In the 'identity rules' task, the first row contains three objects that follow an abstract pattern (ABA, ABB, or AAA), and the task is to select the choice that would result in the same relation being instantiated in the second row. We presented the problems in the same format as the 'distribution of-3' task.
Figure 4: Synthetic Visual Reasoning Test (SVRT) Dataset.(a) Examples of task depicting same/different relation. Each row shows an example from each of the two categories. In category 1 there are three sets of two identical shapes. In category 2 there are two sets of three identical shapes. (b) Examples of task depicting spatial relation. Each row shows an example from each of the two categories. In category 1 three out of four shapes are touching each other. In category 2 there are two sets of two shapes touching each other.
Figure 5: CLEVR-ART Dataset. Relational-match-to-sample: An example depicting 'different' relation between the source pair (back row) of objects and the target pair (front row) of objects. Problems were presented in the same format as the 'relational match-to-sample' task of the ART dataset. Identity rules: An example problem depicting ABA rule among the back row and front row of objects. Problems were presented in the same format as the 'identity rules' task of the ART dataset.
...and 4 more figures

Slot Abstractors: Toward Scalable Abstract Visual Reasoning

TL;DR

Abstract

Slot Abstractors: Toward Scalable Abstract Visual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)