Table of Contents
Fetching ...

Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

Huankun Sheng, Ming Li, Yixiang Wei, Yeying Fan, Yu-Hui Wen, Tieliang Gong, Yong-Jin Liu

TL;DR

This work tackles foreground-background interference in unsupervised object-centric learning by introducing Foreground-Aware Slot Attention (FASA), a two-stage framework that first disentangles foreground from background and then decomposes foreground into object slots. It combines clustering-based initialization, a masked slot attention mechanism that reserves the first slot for background, and pseudo-mask guidance via MaskCut (based on a patch-affinity graph) to reduce over-segmentation. The approach achieves state-of-the-art performance on COCO, PASCAL VOC, and MOVi-C, and demonstrates strong generalization to zero-shot object discovery and localization across synthetic and real-world datasets. The results highlight the value of explicit foreground modeling and pseudo-mask guidance for robust, object-coherent scene decomposition in unsupervised settings.

Abstract

Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.

Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

TL;DR

This work tackles foreground-background interference in unsupervised object-centric learning by introducing Foreground-Aware Slot Attention (FASA), a two-stage framework that first disentangles foreground from background and then decomposes foreground into object slots. It combines clustering-based initialization, a masked slot attention mechanism that reserves the first slot for background, and pseudo-mask guidance via MaskCut (based on a patch-affinity graph) to reduce over-segmentation. The approach achieves state-of-the-art performance on COCO, PASCAL VOC, and MOVi-C, and demonstrates strong generalization to zero-shot object discovery and localization across synthetic and real-world datasets. The results highlight the value of explicit foreground modeling and pseudo-mask guidance for robust, object-coherent scene decomposition in unsupervised settings.

Abstract

Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.

Paper Structure

This paper contains 37 sections, 10 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Architecture of the Foreground-Aware Slot Attention (FASA). Our framework operates in two sequential stages. First, a two-slot attention module, trained with feature reconstruction loss, generates a binary mask separating foreground from background regions. Conditioned on this mask, the second stage introduces a masked slot attention mechanism that binds input features to different slots: the first slot is dedicated to representing the background, while the remaining slots correspond to foreground objects. In addition, pseudo labels obtained from MaskCut wang2023cutler are used to guide the learning of the foreground slots.
  • Figure 2: Qualitative comparisonsof object discovery. Existing methods tend to over-segment both background regions and foreground objects, as indicated by the dashed boxes. They also struggle to achieve precise instance-level decomposition, as highlighted by the solid boxes. In comparison, our method produces object masks that align closely with the ground-truth annotations.
  • Figure 3: Visual comparison of foreground–background separation. Green regions denote detected foreground; red regions indicate background. Our method achieves more accurate structural separation between foreground objects and background areas, generating masks that align closely with ground-truth annotations.
  • Figure 4: Visualization of results with and without pseudo-mask supervision. (a) Without pseudo-mask supervision; (b) With pseudo-mask supervision. Using pseudo-mask supervision can effectively reduce over-segmentation.
  • Figure S1: Visualization of zero-shot object discovery. Our approach exhibits strong generalization capabilities across both synthetic datasets and challenging real-world image datasets, maintaining robust performance even on data that significantly deviates from the training distribution.
  • ...and 3 more figures