Table of Contents
Fetching ...

Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen

TL;DR

RandSF.Q tackles two core limitations of unsupervised video OCL: underutilization of next-frame features for query prediction and insufficient learning of transition dynamics. It introduces a transitioner that conditions on both current slots and next-frame features, and a training scheme that samples random slot–feature pairs to learn transitions. Across MOVi-C, MOVi-D, YTVIS, and CLEVRER, RandSF.Q achieves state-of-the-art object discovery and notable improvements in object recognition and visual question answering, demonstrating stronger, more informative object-centric scene representations. The approach remains end-to-end trainable with manageable computational overhead and offers a practical advance for downstream scene understanding tasks.

Abstract

Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and understanding as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like scene understanding. Source Code, Model Checkpoints, Training Logs: https://github.com/Genera1Z/RandSF.Q

Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

TL;DR

RandSF.Q tackles two core limitations of unsupervised video OCL: underutilization of next-frame features for query prediction and insufficient learning of transition dynamics. It introduces a transitioner that conditions on both current slots and next-frame features, and a training scheme that samples random slot–feature pairs to learn transitions. Across MOVi-C, MOVi-D, YTVIS, and CLEVRER, RandSF.Q achieves state-of-the-art object discovery and notable improvements in object recognition and visual question answering, demonstrating stronger, more informative object-centric scene representations. The approach remains end-to-end trainable with manageable computational overhead and offers a practical advance for downstream scene understanding tasks.

Abstract

Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and understanding as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like scene understanding. Source Code, Model Checkpoints, Training Logs: https://github.com/Genera1Z/RandSF.Q

Paper Structure

This paper contains 15 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) Mainstream video OCL adopts a recurrent architecture, where a transitioner transits current slots into the query for next video frame, and an aggregator aggregates the next frame feature into slots under the query. (b) Our intuitive observation: To predict next query using the transitioner, next frame feature is already available and very informative thus should also be utilized. (c) Our empirical observation: By removing the transitioner and using current slots directly as next query, the algorithm works even better -- Existing transitioner is not effectively learned.
  • Figure 2: Our model architecture. (left) Our method is built upon SlotContrastmanasyan2025slotcontrast. A frozen DINO2 oquab2023dino2 model encodes current video frame $\boldsymbol{I}_t$ into current feature $\boldsymbol{F}_t$; A Slot Attention locatello2020slotattent module aggregates$\boldsymbol{F}_t$ into current object-level vectors, slots $\boldsymbol{S}_t$, under current query $\boldsymbol{Q}_t$; A Transformer decoder block vaswani2017transformertransits$\boldsymbol{S}_t$ conditioned on next feature $\boldsymbol{F}_{t+1}$ to next query $\boldsymbol{Q}_{t+1}$; A random Transformer decoder zhao2025diasdecodes$\boldsymbol{S}_t$ into current reconstruction $\boldsymbol{F}'_t$. The objective is minimizing difference between $\boldsymbol{F}_t$ and $\boldsymbol{F}'_t$. (right) How our transitioner works. To effectively learn transition dynamics, our transitioner explores slots $\boldsymbol{S}_{t_1}$ and feature $\boldsymbol{F}_{t_2}$ at any past time point within window size $\mathit{\Delta}$ to predict next query $\boldsymbol{Q}_{t+1}$ during training. Relative time embeddings are added to $\boldsymbol{S}_{t_1}$ and $\boldsymbol{F}_{t_2}$ to indicate their offset from $t+1$. To maximize prediction accuracy, our trainsitioner exploits only the most recent slots $\boldsymbol{S}_t$ and feature $\boldsymbol{F}_{t+1}$ to predict $\boldsymbol{Q}_{t+1}$ during evaluation. The left sub-figure is adapted from zhao2025vvo.
  • Figure 3: Qualitative results of our RandSF.Q on YTVIS, compared with SotA SlotContrast.
  • Figure 4: RandSF.Q performance with queries predicted from slot-feature pairs at different relative time steps.