Table of Contents
Fetching ...

Smoothing Slot Attention Iterations and Recurrences

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

TL;DR

SmoothSA tackles two core limitations of Slot Attention: the lack of sample-specific cues in cold-start queries on the first frame and the inappropriate uniform transforms used across video frame recurrences. It introduces a preheating module that self-distills informative queries for the first frame and differentiates SA transforms to handle the first and non-first frames with depths of $3$ and $1$ iterations, respectively. Empirical results across object discovery, recognition, and VQA show state-of-the-art performance on image and video OCL benchmarks, along with consistent downstream gains. These advances yield more accurate object-centric representations and more efficient video processing, with analysis clarifying how preheating stabilizes iterations and transform differentiation stabilizes recurrences.

Abstract

Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our source code, model checkpoints and training logs are available on https://github.com/Genera1Z/SmoothSA.

Smoothing Slot Attention Iterations and Recurrences

TL;DR

SmoothSA tackles two core limitations of Slot Attention: the lack of sample-specific cues in cold-start queries on the first frame and the inappropriate uniform transforms used across video frame recurrences. It introduces a preheating module that self-distills informative queries for the first frame and differentiates SA transforms to handle the first and non-first frames with depths of and iterations, respectively. Empirical results across object discovery, recognition, and VQA show state-of-the-art performance on image and video OCL benchmarks, along with consistent downstream gains. These advances yield more accurate object-centric representations and more efficient video processing, with analysis clarifying how preheating stabilizes iterations and transform differentiation stabilizes recurrences.

Abstract

Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our source code, model checkpoints and training logs are available on https://github.com/Genera1Z/SmoothSA.

Paper Structure

This paper contains 14 sections, 15 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Image Object-Centric Learning (OCL) is essentially realized via Slot Attention (SA) iterations on the image (upper), while video OCL is via SA recurrences across the video's frames (whole). The query cold-start issue in Slot Attention (SA) iterations on the image or video's first frame: The cold-start queries lack sample-specific cues thus hinder precise aggregation. The transform homogeneity issue in SA recurrences on the video's first and non-first frames: Non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation.
  • Figure 2: The overall model and where we modify. (upper) In the OCL model for images, we preheat the cold-start queries to be informative so as to smooth SA iterations on the image (or video's first frame). Our preheater is a tiny module that is trained to predict vectors approximating the slots as the preheated queries from the cold-start queries and image features. (upper + lower) In the OCL model for videos, we differentiate the homogeneous transforms to adapt to the different queries of first and non-first frames so as to smooth SA recurrences across all frames. This is achieved by using full three SA iterations on the first frame and one single SA iteration on non-first frames.
  • Figure 3: Qualitative results of our SmoothSA on images (left) and videos (right), compared with state-of-the-art SPOT and SlotContrast respectively.
  • Figure 4: Performance of models trained with (positive example) and without (negative example) query preheating after reducing number of SA iterations. Positive example's performance drops slightly while negative example's performance degrades quickly.