Table of Contents
Fetching ...

Scene-Centric Unsupervised Panoptic Segmentation

Oliver Hahn, Christoph Reich, Nikita Araslanov, Daniel Cremers, Christian Rupprecht, Stefan Roth

TL;DR

This work tackles unsupervised panoptic segmentation in scene-centric imagery by introducing CUPS, a three-stage framework that fuses SSL visual representations with depth and motion cues to generate high-resolution panoptic pseudo labels. It combines depth-guided semantic labeling with 3D motion-based instance cues (via $SE(3)$ clustering) and completes training through bootstrapping and self-training with a momentum network, achieving state-of-the-art PQ on Cityscapes and strong cross-domain generalization. The approach demonstrates substantial gains in unsupervised semantic and class-agnostic instance segmentation and shows strong data efficiency, enabling label-efficient learning with limited annotated data. Overall, CUPS reduces the reliance on manual annotations for scene understanding and extends unsupervised panoptic segmentation toward practical, real-world deployment across diverse driving datasets.

Abstract

Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

Scene-Centric Unsupervised Panoptic Segmentation

TL;DR

This work tackles unsupervised panoptic segmentation in scene-centric imagery by introducing CUPS, a three-stage framework that fuses SSL visual representations with depth and motion cues to generate high-resolution panoptic pseudo labels. It combines depth-guided semantic labeling with 3D motion-based instance cues (via clustering) and completes training through bootstrapping and self-training with a momentum network, achieving state-of-the-art PQ on Cityscapes and strong cross-domain generalization. The approach demonstrates substantial gains in unsupervised semantic and class-agnostic instance segmentation and shows strong data efficiency, enabling label-efficient learning with limited annotated data. Overall, CUPS reduces the reliance on manual annotations for scene understanding and extends unsupervised panoptic segmentation toward practical, real-world deployment across diverse driving datasets.

Abstract

Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.

Paper Structure

This paper contains 20 sections, 5 equations, 10 figures, 23 tables.

Figures (10)

  • Figure 1: Results and overview of our unsupervised panoptic segmentation approach CUPS. We visualize panoptic predictions (top) of the current state of the art, U2Seg Niu:2024:UUI, and the proposed CUPS on various scene-centric datasets. We utilize motion and depth cues from stereo frames (bottom left) to generate scene-centric pseudo labels. Given a monocular image (bottom right) we learn a panoptic network using our pseudo labels and self-training. CUPS significantly outperforms U2Seg, indicated by the gains in panoptic quality (PQ).
  • Figure 2: Comparing MaskCut Wang:2023:CAL to our instance labeling on Cityscapes val. For scene-centric images, MaskCut attends to areas with high semantic correlation instead of instances, reflected in a mask precision (at a 50 IoU threshold) of 6.5 and 59.6 for MaskCut and our instance labels, respectively.
  • Figure 3: Stage 1: CUPS pseudo-label generation.Instance pseudo labeling applies ensembling-based SF2SE3 motion segmentation Sommer:2022:SCS to scene flow extracted from flow and depth estimates. Semantic pseudo labeling uses a semantic network, distilling and clustering DINO features Caron:2021:EPS, combined with a depth-guided inference. Instance and semantic fusion aligns the two signals into panoptic pseudo labels.
  • Figure 4: Stage 2: CUPS panoptic bootstrapping.
  • Figure 5: Stage 3: CUPS panoptic self-training.
  • ...and 5 more figures