SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers
Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, Nikos Komodakis
TL;DR
SPOT tackles unsupervised object-centric learning by combining two innovations: self-training that distills decoder-derived slot-attention masks back into the encoder to improve object-specific slot generation, and a sequence-permutation strategy for autoregressive transformers that forces slot information to remain influential during reconstruction. The approach is implemented in a two-stage training framework with distillation losses and multiple permutations, and it achieves state-of-the-art or competitive results on real-world datasets like COCO, while maintaining training stability. Key findings include substantial gains in mBO$^i$, mBO$^c$, and mIoU across datasets, the reliability concerns around FG-ARI, and the demonstration that permutations improve the supervisory signal for slot learning in AR decoders. The work suggests broader applicability of sequence permutations to other CV tasks employing autoregressive decoders and provides a practical path toward more robust object-centric representations in complex scenes.
Abstract
Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot .
