Table of Contents
Fetching ...

Few-Shot Panoptic Segmentation With Foundation Models

Markus Käppeler, Kürsat Petek, Niclas Vödisch, Wolfram Burgard, Abhinav Valada

TL;DR

Panoptic segmentation traditionally requires dense pixel-level annotations, limiting deployment. The paper presents SPINO, which leverages a frozen foundation-model backbone (DINOv2) to generate panoptic pseudo-labels from roughly $k \approx 10$ labeled images, using a dual-head pseudo-label generator for semantic segmentation and boundary estimation. These pseudo-labels are produced offline and used to train a downstream panoptic model with online inference, achieving competitive results with substantially fewer ground-truth labels. Across Cityscapes, KITTI-360, and in-house data, SPINO demonstrates that near-supervised performance can be approached with less than a few tenths of a percent of annotations, highlighting the practical potential of foundation-model–driven few-shot learning for complex visual recognition tasks.

Abstract

Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at http://spino.cs.uni-freiburg.de.

Few-Shot Panoptic Segmentation With Foundation Models

TL;DR

Panoptic segmentation traditionally requires dense pixel-level annotations, limiting deployment. The paper presents SPINO, which leverages a frozen foundation-model backbone (DINOv2) to generate panoptic pseudo-labels from roughly labeled images, using a dual-head pseudo-label generator for semantic segmentation and boundary estimation. These pseudo-labels are produced offline and used to train a downstream panoptic model with online inference, achieving competitive results with substantially fewer ground-truth labels. Across Cityscapes, KITTI-360, and in-house data, SPINO demonstrates that near-supervised performance can be approached with less than a few tenths of a percent of annotations, highlighting the practical potential of foundation-model–driven few-shot learning for complex visual recognition tasks.

Abstract

Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at http://spino.cs.uni-freiburg.de.
Paper Structure (10 sections, 5 equations, 5 figures, 6 tables)

This paper contains 10 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: SPINO enables few-shot panoptic segmentation by exploiting descriptive image features from unsupervised task-agnostic pretraining. We generate panoptic pseudo-labels by learning from only $k \approx 10$ annotated images in an offline manner. We can then leverage these pseudo-labels to train any panoptic segmentation model enabling online deployment.
  • Figure 2: Overview of our proposed SPINO approach for few-shot panoptic segmentation. SPINO consists of two learning-based modules for semantic segmentation and boundary estimation that leverage features from the recent foundation model DINOv2 oquab2023dinov2. A panoptic fusion scheme combines their outputs using connected component analysis (CCA) and multiple small instance filtering steps. SPINO creates pseudo-labels for a large number of unlabeled images using only $k \approx 10$ images with ground truth annotations. These pseudo-labels can then be utilized to train any panoptic segmentation model.
  • Figure 3: Our proposed pseudo-label generator comprises two learnable modules for semantic segmentation and boundary estimation that exploit descriptive image features from the recent DINOv2 oquab2023dinov2 foundation model, enabling training with only $k \approx 10$ ground truth panoptic annotations.
  • Figure 4: To enable online predictions and to further boost the performance compared to the pseudo-label generator, we train a bottom-up panoptic segmentation model using our generated pseudo-labels. The network consists of a frozen DINOv2 oquab2023dinov2 backbone with an adapter chen2023vision and three task-specific heads, whose output is merged by a panoptic fusion module cheng2020panoptic.
  • Figure 5: Qualitative performance of our pseudo-label generator in four diverse domains from both public and in-house data sources. From left to right, we show two samples each for Cityscapes cordts2016cityscapes, KITTI-360 liao2022kitti360, in-house automated driving, and an in-house office environment.