MUFASA: A Multi-Layer Framework for Slot Attention
Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth
TL;DR
MUFASA addresses the limitation of slot-attention methods that rely solely on the final ViT layer by exploiting semantic information across multiple encoder layers. It introduces a lightweight, plug-and-play framework that runs independent slot-attention modules on several ViT layers, aligns their slots with Hungarian matching, and fuses them with a learned M-Fusion mechanism into a single representation for decoding. Across VOC, COCO, and MOVi-C, MUFASA consistently improves state-of-the-art unsupervised object segmentation results when integrated into SPOT or DINOSAUR, while also accelerating training convergence and incurring only modest inference overhead. The approach highlights the value of cross-layer semantic richness in ViT representations for object-centric learning and demonstrates practical applicability through substantial performance gains with limited computational burden.
Abstract
Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
