Table of Contents
Fetching ...

MUFASA: A Multi-Layer Framework for Slot Attention

Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

TL;DR

MUFASA addresses the limitation of slot-attention methods that rely solely on the final ViT layer by exploiting semantic information across multiple encoder layers. It introduces a lightweight, plug-and-play framework that runs independent slot-attention modules on several ViT layers, aligns their slots with Hungarian matching, and fuses them with a learned M-Fusion mechanism into a single representation for decoding. Across VOC, COCO, and MOVi-C, MUFASA consistently improves state-of-the-art unsupervised object segmentation results when integrated into SPOT or DINOSAUR, while also accelerating training convergence and incurring only modest inference overhead. The approach highlights the value of cross-layer semantic richness in ViT representations for object-centric learning and demonstrates practical applicability through substantial performance gains with limited computational burden.

Abstract

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

MUFASA: A Multi-Layer Framework for Slot Attention

TL;DR

MUFASA addresses the limitation of slot-attention methods that rely solely on the final ViT layer by exploiting semantic information across multiple encoder layers. It introduces a lightweight, plug-and-play framework that runs independent slot-attention modules on several ViT layers, aligns their slots with Hungarian matching, and fuses them with a learned M-Fusion mechanism into a single representation for decoding. Across VOC, COCO, and MOVi-C, MUFASA consistently improves state-of-the-art unsupervised object segmentation results when integrated into SPOT or DINOSAUR, while also accelerating training convergence and incurring only modest inference overhead. The approach highlights the value of cross-layer semantic richness in ViT representations for object-centric learning and demonstrates practical applicability through substantial performance gains with limited computational burden.

Abstract

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
Paper Structure (21 sections, 6 equations, 10 figures, 7 tables)

This paper contains 21 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: MUFASA. Our novel framework for slot-based methods leverages multiple feature layers of vision transformers for object-centric learning. Integrated into the current best model, SPOT kakogeorgiou2024spot, we achieve a new state of the art in unsupervised object segmentation on PASCAL VOC, COCO, and MOVi-C, producing high-quality segmentation masks while requiring less time to train.
  • Figure 2: Complementarity of DINO layers.(a) PCA visualization for features from layers 4 and 10--12, each encoding varying semantics. (b) Corresponding attention masks from slot attention on these layers, showing different segmentations. (c) Segmentation mask of the single-layer SPOT. (d) The fused slot-attention mask of our SPOT-M captures the person and the dog in a single slot each and follows their boundaries more closely. (e) Gain by combining layers. Blue shows the segmentation accuracy of single-layer DINOSAUR models trained on different encoder layers, yellow is the original DINOSAUR using $\textrm{L}_{12}$. MUFASA on DINOSAUR combines multiple layers, surpassing all individual ones.
  • Figure 3: MUFASA architecture.(a) For an input image, features from multiple layers of a DINO encoder are processed by multiple slot attention (SA) modules, each producing slots $\mathcal{S}_m$ and corresponding attention masks $\mathcal{A}_m^{\mathrm{Slot}}$. After Hungarian matching, a fusion module merges slots and masks. A ViT decoder reconstructs the last encoder layer’s features from fused slots, yielding the decoder attention mask $\mathcal{A}^{\mathrm{Dec}}$. The reconstruction loss $\mathcal{L}_{\mathrm{Rec}}$ guides training. (b)Hungarian matching (HM). The set of slots and attention masks are re-ordered for best correspondence across layers. (c)Fusion module. The re-ordered set of slots and masks are summed in adjacent pairs. Slots are projected into a fused representation $\mathcal{S}_{\mathrm{fused}}$, while a weighted combination of attention masks produces the fused mask $\mathcal{A}_{\mathrm{fused}}^{\mathrm{Slot}}$.
  • Figure 4: Comparison of segmentations. Exemplary segmentation masks on nine different images for SPOT-M (ours), SPOT, DINOSAUR-M (ours), and DINOSAUR. The first three images are from VOC, the next three from COCO, and the last three from MOVi-C. Integrating MUFASA results in segmentations that follow the object boundaries more closely compared to the baselines.
  • Figure 5: Segmentation per layer. Layer-wise SA masks and the fused mask on COCO. Each layer contributes complementary information (e.g., row 1: the plaque and bench edges in $\hat{\mathcal{A}}^{\mathrm{Slot}}_3$vs. coarse segments in $\hat{\mathcal{A}}^{\mathrm{Slot}}_2$); the fused masks appear refined.
  • ...and 5 more figures