Table of Contents
Fetching ...

Panoptic Diffusion Models: co-generation of images and segmentation maps

Yinghan Long, Kaushik Roy

TL;DR

Panoptic Diffusion Models tackle the lack of spatial structure control in diffusion-based generation by co-generating images and panoptic segmentation maps from prompts. The approach uses a dual diffusion framework with either a unified or a two-stream transformer architecture and introduces Multi-Scale Patching to produce high-resolution segmentation maps, guided by segmentation layouts during generation. Training combines image and map denoising objectives, with classifier-free map guidance to balance fidelity and diversity, and a fast DPM solver accelerates inference. On COCO2017, PDM achieves competitive image fidelity (FID) and improved image-text alignment (CLIP), while producing coherent, co-generated segmentation maps that enhance scene controllability and enable downstream vision tasks.

Abstract

Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate an image and a panoptic segmentation of objects and stuff from the prompt. Incorporating an inherent understanding of shapes and scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. We propose a Multi-Scale Patching mechanism to generate high-resolution segmentation maps. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.

Panoptic Diffusion Models: co-generation of images and segmentation maps

TL;DR

Panoptic Diffusion Models tackle the lack of spatial structure control in diffusion-based generation by co-generating images and panoptic segmentation maps from prompts. The approach uses a dual diffusion framework with either a unified or a two-stream transformer architecture and introduces Multi-Scale Patching to produce high-resolution segmentation maps, guided by segmentation layouts during generation. Training combines image and map denoising objectives, with classifier-free map guidance to balance fidelity and diversity, and a fast DPM solver accelerates inference. On COCO2017, PDM achieves competitive image fidelity (FID) and improved image-text alignment (CLIP), while producing coherent, co-generated segmentation maps that enhance scene controllability and enable downstream vision tasks.

Abstract

Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate an image and a panoptic segmentation of objects and stuff from the prompt. Incorporating an inherent understanding of shapes and scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. We propose a Multi-Scale Patching mechanism to generate high-resolution segmentation maps. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.

Paper Structure

This paper contains 29 sections, 8 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Left: images generated by a regular diffusion model (U-ViT) based on the text prompt. Right: images and masks generated by a Panoptic Diffusion Model based on the same text.
  • Figure 2: Pipeline of Panoptic Diffusion Models
  • Figure 3: Two-stream panoptic diffusion model. There are a pretrained image stream on the left and a fine-tuned segmentation map stream on the right.
  • Figure 4: Image-map co-generation. Prompts are: 1) a small copper vase with some flowers in it; 2) A giraffe examining the back of another giraffe; 3) A utility truck is parked in the street beside traffic cones; 4) A white yellow and blue train at an empty train station.
  • Figure 5: Generated maps of different resolutions. Prompts are 1)Three people are playing with a red kick ball; 2) A woman walking next to a man riding a pink bike; 3) An old man is flying his kite in the middle of no where; 4) A large lizard sitting on stone steps with three birds; 5) A girl is playing a game system while other kids look on; 6) A living room that has some couches and tables in it
  • ...and 4 more figures