Guided Slot Attention for Unsupervised Video Object Segmentation
Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, Sangyoun Lee
TL;DR
This work tackles unsupervised video object segmentation in challenging scenes by introducing Guided Slot Attention Network (GSA-Net). It generates guided slots from the target frame, fuses local and global contextual cues through a novel Feature Aggregation Transformer, and refines slot representations using KNN-filtered attention across iterative steps, culminating in a cosine-similarity based decoder for mask generation. Empirical results on DAVIS-16 and FBMS demonstrate state-of-the-art performance and robustness, with ablations confirming the contributions of guided slots, KNN filtering, and FAT. The approach offers a scalable, data-efficient means to improve foreground-background separation without explicit supervision, with practical implications for video understanding tasks.
Abstract
Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.
