Slot State Space Models

Jindong Jiang; Fei Deng; Gautam Singh; Minseung Lee; Sungjin Ahn

Slot State Space Models

Jindong Jiang, Fei Deng, Gautam Singh, Minseung Lee, Sungjin Ahn

TL;DR

SlotSSMs introduce a modular state-space framework that replaces a single monolithic state with multiple independent slots, preserving independent per-slot dynamics while allowing sparse inter-slot communication through self-attention bottlenecks. The approach is instantiated with a slot encoder, per-slot SSM updates, and a slot mixer, and can vary slot counts across layers to capture different abstraction levels. OC-SlotSSMs further employ inverted attention to encourage object-centric decomposition, and a clean training pipeline enables unsupervised object segmentation and attribute prediction. Across multi-object video prediction, long-context reasoning, unsupervised object-centric learning, and 3D visual reasoning, SlotSSMs and OC-SlotSSMs deliver substantial accuracy and efficiency gains, with pretraining providing additional benefits in complex visual tasks.

Abstract

Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, in many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric learning, 3D visual reasoning, and long-context video understanding tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods. Project page is available at https://slotssms.github.io/

Slot State Space Models

TL;DR

Abstract

Paper Structure (30 sections, 21 equations, 13 figures, 5 tables)

This paper contains 30 sections, 21 equations, 13 figures, 5 tables.

Introduction
Preliminaries
Slot State Space Models (SlotSSMs)
Modular Sequence Modeling with SlotSSM
Slot Encoder
Slot Mixer
Sequence Modeling Architecture
Object-Centric Learning with SlotSSM
Object-Centric SlotSSMs (OC-SlotSSMs)
Training Pipeline
Related Work
Experiments
Multi-Object Video Prediction
Long-Context Reasoning
Unsupervised Object-Centric Learning
...and 15 more sections

Figures (13)

Figure 1: SlotSSMs vs existing models. (a) SlotSSMs incorporate modularity through independent state transitions and sparse interactions via self-attention. (b) Traditional SSMs utilize a monolithic state vector for all past information. (c) Multi-slot Transformer-based models offer modularity but with high computational complexity. (d) Multi-slot RNN-based models have modular states but can't parallelize training (red lock). SlotSSMs combine parallelizable training, memory efficiency, and modularity for efficient temporal modeling.
Figure 2: SSM vs SlotSSM. SlotSSM encourages modularity by maintaining a set of separate slot state representations, each updated independently using separate transition matrices and input matrices, allowing for more efficient and scalable modeling of complex sequences with inherent modular structures.
Figure 3: Sequence modeling with SlotSSM. Each layer includes a Slot Encoder, SlotSSM, and Slot Mixer. The Slot Encoder uses a Transformer to extract slots from inputs. The SlotSSM independently updates the slots via separate state transitions. The Slot Mixer introduces inter-slot interactions through self-attention.
Figure 4: Multi-Object Video Prediction Task. Left: Generated video frames at every second step, showing 10 of the 20 total frames generated. Green color indicates ground-truth and red color indicates predictions. Right: MSE over a 20-frame autoregressive rollout, given 10 context frames. SlotSSM demonstrates its efficiency in modeling multi-object dynamics.
Figure 5: Long-Context Construction and Model Efficiency in the Blinking Color Balls Benchmark.Left: We construct long-sequence inputs by patchifying the context images. Right: Comparison of model inference latency with batch size 6. SlotSSM demonstrates computational efficiency for long-sequence processing tasks.
...and 8 more figures

Slot State Space Models

TL;DR

Abstract

Slot State Space Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)