A Generalist Framework for Panoptic Segmentation of Images and Videos
Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, David J. Fleet
TL;DR
Panoptic segmentation faces a permutation-invariant, high-dimensional output challenge. The authors propose a simple, generalist framework that treats panoptic masks as conditioned discrete data generated by Bit Diffusion, incorporating an image encoder and a TransUNet-based mask decoder, with video extension via past-frame conditioning. Their Pix2Seq-$\mathcal{D}$ approach achieves competitive results against specialist methods on COCO and Cityscapes and performs well on unsupervised video segmentation, with ablations validating design choices like analog-bit scaling, cross-entropy loss, and small-object loss weighting. This work demonstrates that a generic diffusion-based formulation can handle large token spaces and streaming video, offering a scalable alternative to task-specific panoptic architectures while leaving room for further improvements such as distillation and architectural refinements.
Abstract
Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.
