Table of Contents
Fetching ...

Guided Slot Attention for Unsupervised Video Object Segmentation

Minhyeok Lee, Suhwan Cho, Dogyoon Lee, Chaewon Park, Jungho Lee, Sangyoun Lee

TL;DR

This work tackles unsupervised video object segmentation in challenging scenes by introducing Guided Slot Attention Network (GSA-Net). It generates guided slots from the target frame, fuses local and global contextual cues through a novel Feature Aggregation Transformer, and refines slot representations using KNN-filtered attention across iterative steps, culminating in a cosine-similarity based decoder for mask generation. Empirical results on DAVIS-16 and FBMS demonstrate state-of-the-art performance and robustness, with ablations confirming the contributions of guided slots, KNN filtering, and FAT. The approach offers a scalable, data-efficient means to improve foreground-background separation without explicit supervision, with practical implications for video understanding tasks.

Abstract

Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.

Guided Slot Attention for Unsupervised Video Object Segmentation

TL;DR

This work tackles unsupervised video object segmentation in challenging scenes by introducing Guided Slot Attention Network (GSA-Net). It generates guided slots from the target frame, fuses local and global contextual cues through a novel Feature Aggregation Transformer, and refines slot representations using KNN-filtered attention across iterative steps, culminating in a cosine-similarity based decoder for mask generation. Empirical results on DAVIS-16 and FBMS demonstrate state-of-the-art performance and robustness, with ablations confirming the contributions of guided slots, KNN filtering, and FAT. The approach offers a scalable, data-efficient means to improve foreground-background separation without explicit supervision, with practical implications for video understanding tasks.

Abstract

Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.
Paper Structure (18 sections, 6 equations, 8 figures, 2 tables)

This paper contains 18 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) Input RGB image. (b) Activation map of baseline encoder features. (c) Slot activation map of the existing slot attention method. (d) Slot activation map of the proposed guided slot attention. When guided slot attention is applied to the encoder, it surpasses the encoder's own foreground extraction ability and shows stronger performance than the previous slot attention even in complex backgrounds.
  • Figure 2: Overall structure of the proposed model. The proposed model consists of independent RGB encoder stream and optical flow encoder stream, and one decoder for mask generation. For simplicity, optical flow stream is omitted in the figure.
  • Figure 3: The structure of the (a) slot generator, (b) local extractor, and (c) global extractor. The slot generator creates guided slots that store important features for mask generation. The local extractor utilizes the K-means clustering algorithm to generate clustering masks at the feature level and extract local features for each region. The global extractor generates soft object regions for the scene through channel-wise softmax operations and extracts global features using these regions.
  • Figure 4: The structure of FAT and GSA. FAT uses attentive pooling to generate intra-frame features from the global features of reference frames and a transformer block to generate global to local features. GSA uses guided slots to provide initial information for foreground and background discrimination, selects the nearest features to each slot from the aggregated features using the KNN algorithm, and applies an iterative attention mechanism to update the slots. FAT and GSA aim to generate useful features for target object mask reconstruction and improve foreground and background discrimination in slot attention.
  • Figure 5: Qualitative comparison between our GSA-Net and other state-of-the-art methods.
  • ...and 3 more figures