Table of Contents
Fetching ...

A Simple and Generalist Approach for Panoptic Segmentation

Nedyalko Prisadnikov, Wouter Van Gansbeke, Danda Pani Paudel, Luc Van Gool

TL;DR

This work demonstrates that a simple generalist panoptic segmentation framework can reach competitive performance by reusing a massively pretrained encoder (DINOv2) with a lightweight decoder and per-pixel prediction. The key innovations are centroid regression in the space of spectral positional embeddings and edge distance sampling, which together mitigate training imbalance between small and large objects and near-boundary regions. The approach achieves state-of-the-art PQ among generalist methods on COCO (PQ = 55.1 without depth, 54.5 with depth) and shows strong results on Cityscapes and depth prediction on NYUv2, highlighting its generalist potential. Overall, the method narrows the gap between generalist and specialized panoptic models while maintaining a simple, scalable architecture with broad applicability.

Abstract

Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not yield good results. We show that this is due to imbalance during training and propose a novel method for reducing it - centroid regression in the space of spectral positional embeddings. Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset, state-of-the-art performance among generalist methods.

A Simple and Generalist Approach for Panoptic Segmentation

TL;DR

This work demonstrates that a simple generalist panoptic segmentation framework can reach competitive performance by reusing a massively pretrained encoder (DINOv2) with a lightweight decoder and per-pixel prediction. The key innovations are centroid regression in the space of spectral positional embeddings and edge distance sampling, which together mitigate training imbalance between small and large objects and near-boundary regions. The approach achieves state-of-the-art PQ among generalist methods on COCO (PQ = 55.1 without depth, 54.5 with depth) and shows strong results on Cityscapes and depth prediction on NYUv2, highlighting its generalist potential. Overall, the method narrows the gap between generalist and specialized panoptic models while maintaining a simple, scalable architecture with broad applicability.

Abstract

Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not yield good results. We show that this is due to imbalance during training and propose a novel method for reducing it - centroid regression in the space of spectral positional embeddings. Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset, state-of-the-art performance among generalist methods.
Paper Structure (43 sections, 14 equations, 8 figures, 4 tables)

This paper contains 43 sections, 14 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overall model architecture in a multi-tasking setup. The vision encoder and the shallow decoder are shared among all tasks. The task specific projection heads are used only to map the decoded pixel embedding to the task required dimension. The depth prediction is an optional auxiliary task.
  • Figure 2: Schematic explanation of how labeling instances work. For each instance we find the $u$ and $v$ coordinates of its center of mass. Based on these coordinates we label each pixel belonging to the instance. The label is the concatenation of the positional embedding of the $u$ and $v$ coordinates, respectively. Hence the two colored third dimension of the target and prediction. Lastly, we apply a distance loss on this representation, but only for pixels near object boundaries according to EDS (Section \ref{['sec:eds']}).
  • Figure 3: With centroid encoding of the instances, the scale of the loss for mistakes at the border between two instances depends on how far apart the two centroids are. Note, that here the distance $d_{12}$ is not very small for visualization purposes, however, the model without PE based encoding (b) still struggles on the boundary. You can also see the prediction from a model trained with PE based encoding (c).
  • Figure 4: Effect of EDS on small instances. This charts shows that EDS has disproportionately bigger effect on smaller instances.
  • Figure 5: Effect of using PEs for the instance segmentation loss. The improvement of the Intersection-over-Union is much stronger for smaller objects.
  • ...and 3 more figures