Table of Contents
Fetching ...

When Slots Compete: Slot Merging in Object-Centric Learning

Christos Chatzisavvas, Panagiotis Rigas, George Ioannakis, Vassilis Katsouros, Nikolaos Mitianoudis

Abstract

Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.

When Slots Compete: Slot Merging in Object-Centric Learning

Abstract

Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.
Paper Structure (21 sections, 9 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 9 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: We introduce a merge operator over the slot set that adaptively refines factorization, producing coherent object-level representations.
  • Figure 2: Illustration of the proposed pipeline.
  • Figure 3: Overview of the proposed training pipeline and merging mechanism. After Slot Attention produces refined slots and attention maps, pairwise Attention-IoU scores are computed to identify the most overlapping slot pair. The selected slots are merged via barycentric aggregation and their attention maps are summed. The procedure is repeated iteratively until no overlap exceeds the threshold, yielding a refined slot set passed to the decoder.
  • Figure 4: Evolution of slot representations during training (left to right). Slot assignments transition from fragmented and unstable to progressively specialized and object-aligned. After stabilization (right), slot merging is enabled to resolve overlap between slots.
  • Figure 5: Visualizations of the iterative slot merging process. Starting from slot representations produced by Slot Attention, the algorithm repeatedly merges the pair of slots with the highest Soft-IoU overlap. The final merged slot set is then passed to the decoder for reconstruction.
  • ...and 1 more figures