Table of Contents
Fetching ...

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, Nikos Komodakis

TL;DR

SPOT tackles unsupervised object-centric learning by combining two innovations: self-training that distills decoder-derived slot-attention masks back into the encoder to improve object-specific slot generation, and a sequence-permutation strategy for autoregressive transformers that forces slot information to remain influential during reconstruction. The approach is implemented in a two-stage training framework with distillation losses and multiple permutations, and it achieves state-of-the-art or competitive results on real-world datasets like COCO, while maintaining training stability. Key findings include substantial gains in mBO$^i$, mBO$^c$, and mIoU across datasets, the reliability concerns around FG-ARI, and the demonstration that permutations improve the supervisory signal for slot learning in AR decoders. The work suggests broader applicability of sequence permutations to other CV tasks employing autoregressive decoders and provides a practical path toward more robust object-centric representations in complex scenes.

Abstract

Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .

SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers

TL;DR

SPOT tackles unsupervised object-centric learning by combining two innovations: self-training that distills decoder-derived slot-attention masks back into the encoder to improve object-specific slot generation, and a sequence-permutation strategy for autoregressive transformers that forces slot information to remain influential during reconstruction. The approach is implemented in a two-stage training framework with distillation losses and multiple permutations, and it achieves state-of-the-art or competitive results on real-world datasets like COCO, while maintaining training stability. Key findings include substantial gains in mBO, mBO, and mIoU across datasets, the reliability concerns around FG-ARI, and the demonstration that permutations improve the supervisory signal for slot learning in AR decoders. The work suggests broader applicability of sequence permutations to other CV tasks employing autoregressive decoders and provides a practical path toward more robust object-centric representations in complex scenes.

Abstract

Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .
Paper Structure (28 sections, 7 equations, 8 figures, 10 tables)

This paper contains 28 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Enhancing unsupervised object-centric learning via self-training. Our two-stage approach starts with exclusive training in the initial stage (not depicted) using the reconstruction loss $L_{\mathrm{REC}}$. In the following stage, shown here, a teacher-student framework is applied. The teacher model, trained in the first stage, guides the student model with an additional loss $L_{\mathrm{ATT}}$, distilling attention masks $A_{\mathrm{DEC}}$ from the teacher's decoder to the slot-attention masks $A_{\mathrm{SLOT}}$ in the student's encoder.
  • Figure 2: Sequence permutations in SPOT. The sequence of patches used for autoregressive-based decoder predictions.
  • Figure 3: $L_1$ gradients norms for each patch's reconstruction loss with respect to the decoder's input slots (aggregated across all the slots, four decoder blocks, and the entire COCO validation set). Subplots show gradients with: (a) default permutation and (b) randomly sampled sequence permutations.
  • Figure 4: Autoregressive (AR) decoding via sequence permutations. Violet boxes indicate differences from typical AR decoder.
  • Figure 5: Example results on COCO 2017, using 7 slots.
  • ...and 3 more figures