A Simple and Generalist Approach for Panoptic Segmentation
Nedyalko Prisadnikov, Wouter Van Gansbeke, Danda Pani Paudel, Luc Van Gool
TL;DR
This work demonstrates that a simple generalist panoptic segmentation framework can reach competitive performance by reusing a massively pretrained encoder (DINOv2) with a lightweight decoder and per-pixel prediction. The key innovations are centroid regression in the space of spectral positional embeddings and edge distance sampling, which together mitigate training imbalance between small and large objects and near-boundary regions. The approach achieves state-of-the-art PQ among generalist methods on COCO (PQ = 55.1 without depth, 54.5 with depth) and shows strong results on Cityscapes and depth prediction on NYUv2, highlighting its generalist potential. Overall, the method narrows the gap between generalist and specialized panoptic models while maintaining a simple, scalable architecture with broad applicability.
Abstract
Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not yield good results. We show that this is due to imbalance during training and propose a novel method for reducing it - centroid regression in the space of spectral positional embeddings. Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset, state-of-the-art performance among generalist methods.
