Table of Contents
Fetching ...

Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen

TL;DR

The paper tackles a core limitation of Object-Centric Learning with Slot Attention: slot redundancy and limited internal supervision. It proposes DIAS, which combines re-initialized aggregation to refresh remaining slots after redundancy reduction, self-distillation to align early- and late-aggregation attention without a teacher, and generalized random auto-regressive decoding to better capture spatial structure. The approach yields state-of-the-art results in object discovery and recognition and improves downstream visual prediction and reasoning, while maintaining training efficiency relative to offline distillation methods. These advancements enable more robust and scalable object-centric representations for both images and videos, with open-source code and checkpoints provided.

Abstract

Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.

Slot Attention with Re-Initialization and Self-Distillation

TL;DR

The paper tackles a core limitation of Object-Centric Learning with Slot Attention: slot redundancy and limited internal supervision. It proposes DIAS, which combines re-initialized aggregation to refresh remaining slots after redundancy reduction, self-distillation to align early- and late-aggregation attention without a teacher, and generalized random auto-regressive decoding to better capture spatial structure. The approach yields state-of-the-art results in object discovery and recognition and improves downstream visual prediction and reasoning, while maintaining training efficiency relative to offline distillation methods. These advancements enable more robust and scalable object-centric representations for both images and videos, with open-source code and checkpoints provided.

Abstract

Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our source code and model checkpoints are available on https://github.com/Genera1Z/DIAS.

Paper Structure

This paper contains 13 sections, 23 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Our method is inspired by the following two key observations. (o1) We re-initialize an extra aggregation to update the remaining slots after slots redundancy reduction, instead of decoding the remaining slots directly. (o2) We self-distill the attention map at the first iteration to approximate the almost-always-good attention at the last aggregation iteration, rather than from aggregation attention to decoding attention.
  • Figure 2: Our DIAS introduces three novel designs: (i) Reinitialized aggregation, which improves slots' object representation quality by re-initializing an extra aggregation to update the remaining slots after slots redundancy reduction; (ii) Self-distilled aggregation, which obtains better internal supervision by approximating the almost-always-better attention map at the last aggregation iteration from that at the first aggregation iteration; and (iii) Random auto-regressive decoding, which enforces the decoder's modeling of spatial correlations by randomly flattening a 2-dimensional feature map into a 1-dimensional sequence. On the top left is the common OCL architecture, which is adapted from zhao2025vvo. $\bm{\phi}_\mathrm{a}$ is OCL aggregator, $\bm{\phi}_\mathrm{d}$ is OCL decoder, and $\bm{\phi}_\mathrm{r}$ is slots redundancy reduction. m is the mask token, bos is the Begin-of-Sentence token, and pos emb stands for the position embedding tensor. All notations are defined in the marked sections.
  • Figure 3: Visualization of object discovery.
  • Figure 4: Visual prediction (upper) and reasoning (lower).