Table of Contents
Fetching ...

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, Ming-Hsuan Yang

TL;DR

This work proposes a unified framework (SemFlow) and model them as a pair of reverse problems, motivated by rectified flow theory, and trains an ordinary differential equation model to transport between the distributions of real images and semantic masks.

Abstract

Semantic segmentation and semantic image synthesis are two representative tasks in visual perception and generation. While existing methods consider them as two distinct tasks, we propose a unified framework (SemFlow) and model them as a pair of reverse problems. Specifically, motivated by rectified flow theory, we train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks. As the training object is symmetric, samples belonging to the two distributions, images and semantic masks, can be effortlessly transferred reversibly. For semantic segmentation, our approach solves the contradiction between the randomness of diffusion outputs and the uniqueness of segmentation results. For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories. Experiments show that our SemFlow achieves competitive results on semantic segmentation and semantic image synthesis tasks. We hope this simple framework will motivate people to rethink the unification of low-level and high-level vision.

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

TL;DR

This work proposes a unified framework (SemFlow) and model them as a pair of reverse problems, motivated by rectified flow theory, and trains an ordinary differential equation model to transport between the distributions of real images and semantic masks.

Abstract

Semantic segmentation and semantic image synthesis are two representative tasks in visual perception and generation. While existing methods consider them as two distinct tasks, we propose a unified framework (SemFlow) and model them as a pair of reverse problems. Specifically, motivated by rectified flow theory, we train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks. As the training object is symmetric, samples belonging to the two distributions, images and semantic masks, can be effortlessly transferred reversibly. For semantic segmentation, our approach solves the contradiction between the randomness of diffusion outputs and the uniqueness of segmentation results. For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories. Experiments show that our SemFlow achieves competitive results on semantic segmentation and semantic image synthesis tasks. We hope this simple framework will motivate people to rethink the unification of low-level and high-level vision.
Paper Structure (14 sections, 14 equations, 12 figures, 2 tables)

This paper contains 14 sections, 14 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Rectified flow bridges semantic segmentation (SS) and semantic image synthesis (SIS). SS and SIS are modeled as a pair of transportation problems between the distributions of images and masks. They share the same ODE and only differ in the direction of the velocity field. We propose a finite perturbation operation on the mask to enable multi-modal generation without changing the semantic labels. Grey dots represent data samples. Colored dots represent semantic centroids, also known as anchors in Eq. \ref{['eq:anchor']}. Colored bubbles represent the scale of perturbation.
  • Figure 2: Semantic segmentation results on COCO-Stuff dataset. For the ground truth, each color reflects the value of anchors (Eq. \ref{['eq:anchor']}), which corresponds to one semantic category, and the color white indicates the ignored regions. The predictions of DSM vary considerably under different random seeds.
  • Figure 3: Semantic segmentation and semantic image synthesis results on Cityscapes dataset. The color black in the ground truth indicates the ignored region. The segmentation results of SemFlow are colored following cordts2016cityscapes.
  • Figure 4: Semantic image synthesis results on CelebAMask-HQ dataset. Semantic masks are colored to show different semantic components. SemFlow w/ Perturbation indicates the finite perturbation operation in Eq. \ref{['eq:add_noise']}.
  • Figure 5: Image synthesis results with different inference steps. We use the forward Euler method to get numerical solutions. Our approach obtains competitive results even with only one inference step.
  • ...and 7 more figures