SlotMatch: Distilling Object-Centric Representations for Unsupervised Video Segmentation
Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu
TL;DR
The paper tackles unsupervised video segmentation by transferring object-centric slot representations from a large teacher to a compact student. It introduces SlotMatch, a Slot-level Knowledge Distillation framework that aligns corresponding teacher and student slots via cosine similarity, using a joint objective $L_{ ext{total}} = L_{ ext{rec}} + \alpha L_{ ext{slot-contrast}} + \beta L_{ ext{slot-KD}}$ and explicitly avoiding reconstruction-level distillation. The authors provide theoretical justification that distilling slot representations suffices to propagate semantic structure, and empirically show that the SlotMatch student matches or outperforms the teacher across MOVi-E, YTVIS-2021, and DAVIS 2017 while being $3.6\times$ smaller and up to $2.7\times$ faster. Zero-shot results on OVIS indicate robustness to occlusion and domain shift. Overall, SlotMatch delivers a simple, efficient path to deploy state-of-the-art object-centric video models in resource-constrained environments.
Abstract
Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on three datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running up to 2.7x faster. Moreover, our student surpasses all other state-of-the-art unsupervised video segmentation models.
