Scene-Centric Unsupervised Panoptic Segmentation
Oliver Hahn, Christoph Reich, Nikita Araslanov, Daniel Cremers, Christian Rupprecht, Stefan Roth
TL;DR
This work tackles unsupervised panoptic segmentation in scene-centric imagery by introducing CUPS, a three-stage framework that fuses SSL visual representations with depth and motion cues to generate high-resolution panoptic pseudo labels. It combines depth-guided semantic labeling with 3D motion-based instance cues (via $SE(3)$ clustering) and completes training through bootstrapping and self-training with a momentum network, achieving state-of-the-art PQ on Cityscapes and strong cross-domain generalization. The approach demonstrates substantial gains in unsupervised semantic and class-agnostic instance segmentation and shows strong data efficiency, enabling label-efficient learning with limited annotated data. Overall, CUPS reduces the reliance on manual annotations for scene understanding and extends unsupervised panoptic segmentation toward practical, real-world deployment across diverse driving datasets.
Abstract
Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.
