Table of Contents
Fetching ...

A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation

Niclas Vödisch, Kürsat Petek, Markus Käppeler, Abhinav Valada, Wolfram Burgard

TL;DR

PASTEL addresses the problem of panoptic segmentation with minimal labeled data by leveraging a foundation-model backbone (DINOv2) to train two lightweight heads for semantic segmentation and boundary estimation, followed by a novel panoptic fusion that uses multi-scale predictions and a recursive two-way normalized cut to separate instances. It further enhances performance through self-training on feature-similar unlabeled images, enabling pseudo-label bootstrapping for the semantic head. The approach achieves state-of-the-art label-efficient results on Cityscapes, Pascal VOC, and PhenoBench, with substantial improvements using as few as 10–20 labeled images and additional gains from pseudo-label-based pretraining of other models. This has practical implications for deploying panoptic segmentation in robotics, where annotation budgets are limited and rapid adaptation to new environments is essential.

Abstract

A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at http://pastel.cs.uni-freiburg.de.

A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation

TL;DR

PASTEL addresses the problem of panoptic segmentation with minimal labeled data by leveraging a foundation-model backbone (DINOv2) to train two lightweight heads for semantic segmentation and boundary estimation, followed by a novel panoptic fusion that uses multi-scale predictions and a recursive two-way normalized cut to separate instances. It further enhances performance through self-training on feature-similar unlabeled images, enabling pseudo-label bootstrapping for the semantic head. The approach achieves state-of-the-art label-efficient results on Cityscapes, Pascal VOC, and PhenoBench, with substantial improvements using as few as 10–20 labeled images and additional gains from pseudo-label-based pretraining of other models. This has practical implications for deploying panoptic segmentation in robotics, where annotation budgets are limited and rapid adaptation to new environments is essential.

Abstract

A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at http://pastel.cs.uni-freiburg.de.
Paper Structure (15 sections, 10 equations, 10 figures, 12 tables)

This paper contains 15 sections, 10 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: We propose PASTEL for label-efficient panoptic segmentation. Our method combines a DINOv2 oquab2023dinov2 backbone, creating descriptive image features, with labels from only $k$ images, e.g., $k=10$ on Citycapes cordts2016cityscapes. A novel fusion module then merges semantic predictions with estimated object boundaries to yield the panoptic output.
  • Figure 2: Test-time overview of PASTEL illustrating the panoptic fusion scheme. For simplicity, we focus on car and road classes after step (1). The overall module is comprised of the following steps: (1) Overlapping multi-scale predictions; (2) Conversion of soft boundary map to an affinity matrix; (3) Boundary denoising; (4) Extraction of "stuff" to "thing" boundaries; (5) Class majority voting within enclosed areas; (6) Connected component analysis (CCA); (7) Filters on "thing" classes; (8) Filters on "stuff" classes; (9) Recursive two-way normalized cut (NCut) to separate connected instances; (10) Nearest neighbors-based hole filling of pixels with the ignore class.
  • Figure 3: We perform multi-scale test-time augmentation with overlapping image crops to mitigate visual artifacts at the borders. Before feeding the crops to the task-specific networks, we upsample them to the original image size. In this figure, we illustrate the approach for scale $s=2$ and an image crop overlap of $z=2$.
  • Figure 4: During self-training, we extract feature vectors $\{l_1, l_2, \dots, l_k\}$ of the labeled images as well as feature vectors $\{u_1, \dots, u_m\}$ of unlabeled images. Since the performance of PASTEL is better on those unlabeled images that are more similar to the samples in the training set, we leverage the cosine similarity as distance measure $d_{ij}$ for image sampling.
  • Figure 5: We provide qualitative results on both Cityscapes (left) and Pascal VOC (right) for examples taken from the respective val split. The depicted results are generated by PASTEL based on the semi-supervised setup, i.e., $\mathcal{L}_k + \mathcal{U}_{n \cdot k}$.
  • ...and 5 more figures