Table of Contents
Fetching ...

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

Wouter Van Gansbeke, Bert De Brabandere

TL;DR

This work reframes panoptic segmentation as an image-conditioned latent diffusion problem to avoid detection-heavy pipelines and post-processing. It introduces a two-stage framework: Stage 1 trains a shallow autoencoder to compress segmentation targets into latent codes $z_t$, and Stage 2 learns a conditional diffusion model operating on latents with image latents $z_i$ to generate segmentation maps, capturing the conditional distribution $p(y|z_t,z_i)$. The model supports segmentation mask inpainting and extends to multi-task dense prediction via learnable task embeddings. Experiments on COCO and ADE20k show competitive performance against specialized and generalist baselines, demonstrating the approach's simplicity, flexibility, and potential for scaling to additional tasks.

Abstract

Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. Our training consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. This generative approach unlocks the exploration of mask completion or inpainting. The experimental validation on COCO and ADE20k yields strong segmentation results. Finally, we demonstrate our model's adaptability to multi-tasking by introducing learnable task embeddings.

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

TL;DR

This work reframes panoptic segmentation as an image-conditioned latent diffusion problem to avoid detection-heavy pipelines and post-processing. It introduces a two-stage framework: Stage 1 trains a shallow autoencoder to compress segmentation targets into latent codes , and Stage 2 learns a conditional diffusion model operating on latents with image latents to generate segmentation maps, capturing the conditional distribution . The model supports segmentation mask inpainting and extends to multi-task dense prediction via learnable task embeddings. Experiments on COCO and ADE20k show competitive performance against specialized and generalist baselines, demonstrating the approach's simplicity, flexibility, and potential for scaling to additional tasks.

Abstract

Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. Our training consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. This generative approach unlocks the exploration of mask completion or inpainting. The experimental validation on COCO and ADE20k yields strong segmentation results. Finally, we demonstrate our model's adaptability to multi-tasking by introducing learnable task embeddings.
Paper Structure (26 sections, 4 equations, 16 figures, 6 tables, 3 algorithms)

This paper contains 26 sections, 4 equations, 16 figures, 6 tables, 3 algorithms.

Figures (16)

  • Figure 1: (Left:) We present a simple generative approach for panoptic segmentation that builds upon Stable Diffusion rombach2022high. The key idea is to leverage the diffusion process to bypass complex detection modules and to unlock mask inpainting. The generative process is conditioned on RGB images to iteratively predict the masks. (Right:) Our framework can be extended to a multi-task setting by introducing task embeddings.
  • Figure 2: Overview of LDMSeg. Inspired by latent diffusion models, we present a simple diffusion framework for segmentation and mask inpainting. The approach consists of two stages: (i) learn continuous codes $z_t$ with a shallow autoencoder on the labels (Sec. \ref{['subsec: stage1']}); (ii) learn a denoising function conditioned on image latents $z_i$ (Sec. \ref{['subsec: stage2']}). In the second stage, the error between the predicted noise $\hat{\epsilon}$ and the applied Gaussian noise $\epsilon$ is minimized. During inference, we traverse the denoising process by starting from Gaussian noise. The models $f_t$ and $f_i$ respectively encode the labels and images. While we rely on the image encoder $f_i$ from Stable Diffusion rombach2022high, we focus on $f_t$ and $g$ for segmentation. We aim to prioritize generality by limiting task-specific components.
  • Figure 3: Diffusion Process and SNR. (1) During training we randomly sample a timestep from $[1, T]$ in the denoising process. We can increase the RGB-image's importance by strengthening the noise: (i) Following rombach2022highchen2023generalist, we downscale the latents $z_c$ using scaling factor $s \in \mathbb{R}$ and demonstrate its impact -- Row 1 ($s = 1.0$) is clearly easier to decode than row 2 ($s \approx 0.18$rombach2022high). (ii) Losses for timesteps near $0$ are further downscaled to avoid overfitting. Both strategies force the model to focus on the RGB image in generating plausible segmentation maps. Note, we don't apply explicit constraints to the prior distribution $p(z_t)$, e.g., match a standard Gaussian $\mathcal{N}(0, 1)$. (2) During sampling the denoising process is traversed from right to left in $T_s$ iterations.
  • Figure 4: Qualitative Results. The figure displays results on COCO val2017. We follow the inference setup (Section \ref{['sec: diffussion_process']}) to sample from our model. Only the $\mathop{\mathrm{arg\,max}}\limits$ operator is applied for post-processing. Our model disentangles overlapping instances in challenging scenes without complex modules or post-processing. To visualize, segments are assigned to random colors, and missing (VOID) pixels in the ground truth are black.
  • Figure 5: Mask Inpainting. The figure visualizes generated samples for different granularity levels by following Section \ref{['subsec: inpainting']}. The model can fill in missing regions by propagating the partially given (random) segmentation IDs using an image-conditioned diffusion process. Global mask inpainting (left) results are reasonable out-of-the-box while sparse mask inpainting (right) shows inaccuracies. We hypothesize that this can be addressed by further finetuning LDMseg on sparse inpainting masks.
  • ...and 11 more figures