Table of Contents
Fetching ...

SlotMatch: Distilling Object-Centric Representations for Unsupervised Video Segmentation

Diana-Nicoleta Grigore, Neelu Madan, Andreas Mogelmose, Thomas B. Moeslund, Radu Tudor Ionescu

TL;DR

The paper tackles unsupervised video segmentation by transferring object-centric slot representations from a large teacher to a compact student. It introduces SlotMatch, a Slot-level Knowledge Distillation framework that aligns corresponding teacher and student slots via cosine similarity, using a joint objective $L_{ ext{total}} = L_{ ext{rec}} + \alpha L_{ ext{slot-contrast}} + \beta L_{ ext{slot-KD}}$ and explicitly avoiding reconstruction-level distillation. The authors provide theoretical justification that distilling slot representations suffices to propagate semantic structure, and empirically show that the SlotMatch student matches or outperforms the teacher across MOVi-E, YTVIS-2021, and DAVIS 2017 while being $3.6\times$ smaller and up to $2.7\times$ faster. Zero-shot results on OVIS indicate robustness to occlusion and domain shift. Overall, SlotMatch delivers a simple, efficient path to deploy state-of-the-art object-centric video models in resource-constrained environments.

Abstract

Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on three datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running up to 2.7x faster. Moreover, our student surpasses all other state-of-the-art unsupervised video segmentation models.

SlotMatch: Distilling Object-Centric Representations for Unsupervised Video Segmentation

TL;DR

The paper tackles unsupervised video segmentation by transferring object-centric slot representations from a large teacher to a compact student. It introduces SlotMatch, a Slot-level Knowledge Distillation framework that aligns corresponding teacher and student slots via cosine similarity, using a joint objective and explicitly avoiding reconstruction-level distillation. The authors provide theoretical justification that distilling slot representations suffices to propagate semantic structure, and empirically show that the SlotMatch student matches or outperforms the teacher across MOVi-E, YTVIS-2021, and DAVIS 2017 while being smaller and up to faster. Zero-shot results on OVIS indicate robustness to occlusion and domain shift. Overall, SlotMatch delivers a simple, efficient path to deploy state-of-the-art object-centric video models in resource-constrained environments.

Abstract

Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on three datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running up to 2.7x faster. Moreover, our student surpasses all other state-of-the-art unsupervised video segmentation models.

Paper Structure

This paper contains 18 sections, 1 theorem, 19 equations, 4 figures, 10 tables.

Key Result

Theorem 1

Let $s^\mathbf{T}\!\in \mathbb{R}^d$ be a teacher slot and $s^\mathbf{S}\!\in \mathbb{R}^d$ a student slot, with $\left\|s^\mathbf{T}\right\| = \left\|s^\mathbf{S}\right\| = r$, where $r>0$. Let $f\!: \mathbb{R}^d \to \mathbb{R}^m$ be a $K_f$-Lipschitz neural network that decodes the slots into feat then:

Figures (4)

  • Figure 1: Comparison of SlotContrast (teacher) versus various student versions (including our SlotMatch), showing the trade-off between performance (mean Best Overlap or mBO) vs. inference speed (FPS). Circle area indicates parameter count (in millions). SlotMatch (cyan) outperforms its teacher, while being nearly twice as fast on NVIDIA A100. Best viewed in color.
  • Figure 2: Our SlotMatch framework performs knowledge distillation from a large frozen teacher model to a compact trainable student model. Both models process video frames through slot attention mechanisms, with the student learning through three loss components: reconstruction ($\mathcal{L}_{\text{rec}}$), temporal consistency ($\mathcal{L}_{\text{slot-contrast}}$), and our novel slot matching loss ($\mathcal{L}_{\text{slot-KD}}$) that directly aligns corresponding slots between teacher and student models using cosine similarity. Best viewed in color.
  • Figure 3: Qualitative segmentation results on MOVi-E (left) and YTVIS-2021 (right). The first row shows raw frames; the second and third rows show slots from the student and teacher models, respectively; the final row presents results from our distillation-based SlotMatch. SlotMatch recovers missed slots, refines object boundaries, and produces sharper, more consistent slots. Mistakes by the student and teacher models are annotated in red, while corrections and additional detections introduced by SlotMatch are highlighted in green. Best viewed in color.
  • Figure 4: Qualitative comparison on MOVi-E (left) and YTVIS-2021 (right). The second row shows outputs from the student model, while the third row presents results from our distillation-based SlotMatch. Student errors, including missed slots, are marked in red. Corrections and additional slots introduced by SlotMatch are highlighted in green. Best viewed in color.

Theorems & Definitions (3)

  • Theorem 1
  • proof
  • proof