Simplified priors for Object-Centric Learning
Vihang Patil, Andreas Radler, Daniel Klotz, Sepp Hochreiter
TL;DR
SAMP proposes a minimal, fully-differentiable, non-iterative approach to object-centric learning by combining specialized sub-networks with MaxPool-based competition and a Simplified Slot Attention (SSA) layer, followed by a Spatial Broadcast Decoder. The method encodes images with a CNN, derives primitive slots via grouped sub-networks, and uses those as queries in SSA where $W = softmax\left(\frac{K Q^{T}}{\tau}\right)$ and $S = W^{T} V$ with $\tau = \sqrt{n}$, producing slot representations that are decoded separately and blended through per-slot masks. Empirically, SAMP matches or exceeds prior slot-based methods on standard benchmarks (CLEVR6, Multi-dSprites, Tetrominoes), while offering superior scalability due to its non-iterative design. The work also analyzes the role of attention mechanisms and slot count, and discusses implications for continual learning where object-centric representations enable robust abstraction and knowledge transfer under resource constraints. Overall, SAMP demonstrates that a simple, competition-driven architecture using conventional CNNs can achieve strong object-centric representations without iterative refinement.
Abstract
Humans excel at abstracting data and constructing \emph{reusable} concepts, a capability lacking in current continual learning systems. The field of object-centric learning addresses this by developing abstract representations, or slots, from data without human supervision. Different methods have been proposed to tackle this task for images, whereas most are overly complex, non-differentiable, or poorly scalable. In this paper, we introduce a conceptually simple, fully-differentiable, non-iterative, and scalable method called SAMP Simplified Slot Attention with Max Pool Priors). It is implementable using only Convolution and MaxPool layers and an Attention layer. Our method encodes the input image with a Convolutional Neural Network and then uses a branch of alternating Convolution and MaxPool layers to create specialized sub-networks and extract primitive slots. These primitive slots are then used as queries for a Simplified Slot Attention over the encoded image. Despite its simplicity, our method is competitive or outperforms previous methods on standard benchmarks.
