Table of Contents
Fetching ...

A Generalist Framework for Panoptic Segmentation of Images and Videos

Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, David J. Fleet

TL;DR

Panoptic segmentation faces a permutation-invariant, high-dimensional output challenge. The authors propose a simple, generalist framework that treats panoptic masks as conditioned discrete data generated by Bit Diffusion, incorporating an image encoder and a TransUNet-based mask decoder, with video extension via past-frame conditioning. Their Pix2Seq-$\mathcal{D}$ approach achieves competitive results against specialist methods on COCO and Cityscapes and performs well on unsupervised video segmentation, with ablations validating design choices like analog-bit scaling, cross-entropy loss, and small-object loss weighting. This work demonstrates that a generic diffusion-based formulation can handle large token spaces and streaming video, offering a scalable alternative to task-specific panoptic architectures while leaving room for further improvements such as distillation and architectural refinements.

Abstract

Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.

A Generalist Framework for Panoptic Segmentation of Images and Videos

TL;DR

Panoptic segmentation faces a permutation-invariant, high-dimensional output challenge. The authors propose a simple, generalist framework that treats panoptic masks as conditioned discrete data generated by Bit Diffusion, incorporating an image encoder and a TransUNet-based mask decoder, with video extension via past-frame conditioning. Their Pix2Seq- approach achieves competitive results against specialist methods on COCO and Cityscapes and performs well on unsupervised video segmentation, with ablations validating design choices like analog-bit scaling, cross-entropy loss, and small-object loss weighting. This work demonstrates that a generic diffusion-based formulation can handle large token spaces and streaming video, offering a scalable alternative to task-specific panoptic architectures while leaving room for further improvements such as distillation and architectural refinements.

Abstract

Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings.
Paper Structure (33 sections, 4 equations, 15 figures, 8 tables, 4 algorithms)

This paper contains 33 sections, 4 equations, 15 figures, 8 tables, 4 algorithms.

Figures (15)

  • Figure 1: We formulate panoptic segmentation as a conditional discrete mask ($\bm m$) generation problem for images (left) and videos (right), using a Bit Diffusion generative model chen2022analog.
  • Figure 2: The architecture for our panoptic mask generation framework. We separate the model into image encoder and mask decoder so that the iterative inference at test time only involves multiple passes over the decoder.
  • Figure 3: Noisy masks at different time steps under two input scaling factors, $b=1.0$ (top row) and $b=0.1$ (bottom row). Decreasing the input scaling factor leads to smaller signal-to-noise ratio (at the same time step), which gives higher weights to harder cases.
  • Figure 4: The effect of $p$ on loss weighting for panoptic masks. With $p=0$, every mask token is weighted equally (equivalent to no weighting). As $p$ increases, larger weight is given to mask tokens of smaller instances (indicated by warmer colors).
  • Figure 5: Mask decoder extended for video settings. The image conditional signal to the mask decoder is concatenated with mask predictions from previous frames of the video.
  • ...and 10 more figures