Simplified priors for Object-Centric Learning

Vihang Patil; Andreas Radler; Daniel Klotz; Sepp Hochreiter

Simplified priors for Object-Centric Learning

Vihang Patil, Andreas Radler, Daniel Klotz, Sepp Hochreiter

TL;DR

SAMP proposes a minimal, fully-differentiable, non-iterative approach to object-centric learning by combining specialized sub-networks with MaxPool-based competition and a Simplified Slot Attention (SSA) layer, followed by a Spatial Broadcast Decoder. The method encodes images with a CNN, derives primitive slots via grouped sub-networks, and uses those as queries in SSA where $W = softmax\left(\frac{K Q^{T}}{\tau}\right)$ and $S = W^{T} V$ with $\tau = \sqrt{n}$, producing slot representations that are decoded separately and blended through per-slot masks. Empirically, SAMP matches or exceeds prior slot-based methods on standard benchmarks (CLEVR6, Multi-dSprites, Tetrominoes), while offering superior scalability due to its non-iterative design. The work also analyzes the role of attention mechanisms and slot count, and discusses implications for continual learning where object-centric representations enable robust abstraction and knowledge transfer under resource constraints. Overall, SAMP demonstrates that a simple, competition-driven architecture using conventional CNNs can achieve strong object-centric representations without iterative refinement.

Abstract

Humans excel at abstracting data and constructing \emph{reusable} concepts, a capability lacking in current continual learning systems. The field of object-centric learning addresses this by developing abstract representations, or slots, from data without human supervision. Different methods have been proposed to tackle this task for images, whereas most are overly complex, non-differentiable, or poorly scalable. In this paper, we introduce a conceptually simple, fully-differentiable, non-iterative, and scalable method called SAMP Simplified Slot Attention with Max Pool Priors). It is implementable using only Convolution and MaxPool layers and an Attention layer. Our method encodes the input image with a Convolutional Neural Network and then uses a branch of alternating Convolution and MaxPool layers to create specialized sub-networks and extract primitive slots. These primitive slots are then used as queries for a Simplified Slot Attention over the encoded image. Despite its simplicity, our method is competitive or outperforms previous methods on standard benchmarks.

Simplified priors for Object-Centric Learning

TL;DR

and

with

, producing slot representations that are decoded separately and blended through per-slot masks. Empirically, SAMP matches or exceeds prior slot-based methods on standard benchmarks (CLEVR6, Multi-dSprites, Tetrominoes), while offering superior scalability due to its non-iterative design. The work also analyzes the role of attention mechanisms and slot count, and discusses implications for continual learning where object-centric representations enable robust abstraction and knowledge transfer under resource constraints. Overall, SAMP demonstrates that a simple, competition-driven architecture using conventional CNNs can achieve strong object-centric representations without iterative refinement.

Abstract

Paper Structure (39 sections, 1 equation, 12 figures, 10 tables, 2 algorithms)

This paper contains 39 sections, 1 equation, 12 figures, 10 tables, 2 algorithms.

Introduction
Related Work
Graph-based approaches
Generative approaches
Iterative refinement approaches
Competition in Neural Networks
Method
Architecture
Encoder
Grouping
Decoder
Competition in SAMP
Specialization through MaxPool layers
Competition through SSA Layer
Competition through Spatial Broadcast Decoder
...and 24 more sections

Figures (12)

Figure 1: Grouping: We learn Primitive Slots from image features using Specialized Sub-Networks. We obtain pixel features by flattening all the image features from the encoder. We pass pixel features and Primitive Slots to a Simplified Slot Attention (SSA) layer, where Keys (K) and Values (V) are the pixel features and Queries are the Primitive Slots. SSA layer outputs the slots. The decoder is applied on every slot separately to reconstruct the input and a mask. A softmax is applied to the masks along the pixel dimension (for simplicity the masks are not shown in the figure). The final reconstruction is obtained by performing a weighted sum of all the individual reconstructions across the pixels with the weights coming from the masks.
Figure 2: Specialized Sub-Networks: We use alternating Convolution and MaxPool layers. After these layers, we flatten features to obtain Primitive Slots. The architecture along with the slot-wise reconstruction in the decoder, induces specialization in sub-networks. The sub-networks are forced to explain different parts of the input. Therefore, the resultant Primitive Slots are good queries for the SSA layer.
Figure 3: Results on CLEVR6: The first column is the original image. The second column is the final reconstruction by the model, namely the weighted sum of individual reconstructions. Columns 3-11 are reconstructions of individual slots. The individual reconstructions are displayed without the mask.
Figure 4: Left: Reconstructions of Multi-dSprites. The first column is the original image. The second column is the weighted sum of individual reconstructions where the weights come from the masks to which a pixel-wise softmax was applied. Columns 3-11 are reconstructions of individual slots. Right: Reconstructions of Tetrominoes. Again, the first column is the original image, while the second column is the weighted sum of individual reconstructions. Columns 3-6 are reconstructions of individual slots. The individual reconstructions are displayed without the mask.
Figure 5: Reconstructions and visualized attention heatmaps of slots over pixel features during training on Tetrominoes: The numbers on the left denote the completed training epochs, the left group of images are reconstructions, whereas the right group are visualized attention maps. The columns of the reconstruction images are in the following order: (col. 1) ground truth image, (col. 2) final reconstruction (cols. 3-6) individual slot reconstructions. The columns of the visualized attention maps are in the following order: (col.1) ground truth image, (cols. 2-5) individual attention maps of the queries over the keys (i.e. the projected pixel features).
...and 7 more figures

Simplified priors for Object-Centric Learning

TL;DR

Abstract

Simplified priors for Object-Centric Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)