Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance
Huankun Sheng, Ming Li, Yixiang Wei, Yeying Fan, Yu-Hui Wen, Tieliang Gong, Yong-Jin Liu
TL;DR
This work tackles foreground-background interference in unsupervised object-centric learning by introducing Foreground-Aware Slot Attention (FASA), a two-stage framework that first disentangles foreground from background and then decomposes foreground into object slots. It combines clustering-based initialization, a masked slot attention mechanism that reserves the first slot for background, and pseudo-mask guidance via MaskCut (based on a patch-affinity graph) to reduce over-segmentation. The approach achieves state-of-the-art performance on COCO, PASCAL VOC, and MOVi-C, and demonstrates strong generalization to zero-shot object discovery and localization across synthetic and real-world datasets. The results highlight the value of explicit foreground modeling and pseudo-mask guidance for robust, object-coherent scene decomposition in unsupervised settings.
Abstract
Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
