Few-Shot Panoptic Segmentation With Foundation Models
Markus Käppeler, Kürsat Petek, Niclas Vödisch, Wolfram Burgard, Abhinav Valada
TL;DR
Panoptic segmentation traditionally requires dense pixel-level annotations, limiting deployment. The paper presents SPINO, which leverages a frozen foundation-model backbone (DINOv2) to generate panoptic pseudo-labels from roughly $k \approx 10$ labeled images, using a dual-head pseudo-label generator for semantic segmentation and boundary estimation. These pseudo-labels are produced offline and used to train a downstream panoptic model with online inference, achieving competitive results with substantially fewer ground-truth labels. Across Cityscapes, KITTI-360, and in-house data, SPINO demonstrates that near-supervised performance can be approached with less than a few tenths of a percent of annotations, highlighting the practical potential of foundation-model–driven few-shot learning for complex visual recognition tasks.
Abstract
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at http://spino.cs.uni-freiburg.de.
