Table of Contents
Fetching ...

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, Zheng Zhang

TL;DR

The paper addresses the limitation of fixed slot counts in object-centric learning by introducing AdaSlot, which adaptively determines the number of slots per image instance. AdaSlot comprises a complexity-aware object auto-encoder with an upper bound $K_{max}$, a mean-field slot sampling module based on Gumbel-Softmax to produce a discrete mask $Z$, and a masked slot decoder for reconstruction. Across datasets such as CLEVR10, MOVi-C/E, and COCO, AdaSlot matches or surpasses fixed-slot baselines on object grouping while producing per-image slot counts that correlate with ground-truth object counts. The results demonstrate that dynamic slot adaptation is a viable path for scalable, unsupervised object discovery and slot attention research.

Abstract

Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine slot representations. However, a major drawback of most object-centric models, including slot attention, is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation, we present a novel complexity-aware object auto-encoder framework. Within this framework, we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore, we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework, tested extensively on object discovery tasks with various datasets, shows performance matching or exceeding top fixed-slot models. Moreover, our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance's complexity, offering the potential for further exploration in slot attention research. Project will be available at https://kfan21.github.io/AdaSlot/

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

TL;DR

The paper addresses the limitation of fixed slot counts in object-centric learning by introducing AdaSlot, which adaptively determines the number of slots per image instance. AdaSlot comprises a complexity-aware object auto-encoder with an upper bound , a mean-field slot sampling module based on Gumbel-Softmax to produce a discrete mask , and a masked slot decoder for reconstruction. Across datasets such as CLEVR10, MOVi-C/E, and COCO, AdaSlot matches or surpasses fixed-slot baselines on object grouping while producing per-image slot counts that correlate with ground-truth object counts. The results demonstrate that dynamic slot adaptation is a viable path for scalable, unsupervised object discovery and slot attention research.

Abstract

Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine slot representations. However, a major drawback of most object-centric models, including slot attention, is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation, we present a novel complexity-aware object auto-encoder framework. Within this framework, we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore, we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework, tested extensively on object discovery tasks with various datasets, shows performance matching or exceeding top fixed-slot models. Moreover, our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance's complexity, offering the potential for further exploration in slot attention research. Project will be available at https://kfan21.github.io/AdaSlot/
Paper Structure (21 sections, 14 equations, 14 figures, 14 tables)

This paper contains 21 sections, 14 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Illustration of raw image and three kinds of segmentation masks under different slot numbers. Pixels colored the same are grouped as the slot. The slot number is very important.
  • Figure 2: Illustration of our pipeline.
  • Figure 3: Visualization of instance-level adaptive slot number selection. We compare our models and the fixed-slot DINOSAUR on three datasets. For each dataset, we select two examples and compare our model with a small slot number and a large slot number.
  • Figure 4: Visualization of instance-level adaptive slot number selection by per-slot segmentation, comparing the fixed 11-slot model(first row) and our model(second row). Dropped slot are left empty.
  • Figure 5: Stratified statistics of four metrics of our models and two fixed slot models, one set the slot number to the upper bound and another set to slot with both high ARI and mBO. We apply stratified sampling according to ground truth object number the image have. The first row is MOVi-C while second row is MOVi-E. The visualizations prove that our model do not over-fit a specific slot number to improve the performance.
  • ...and 9 more figures