Table of Contents
Fetching ...

IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis

Feng Liu, Xiaobin Chang

TL;DR

The paper addresses semantic image synthesis conditioned on segmentation masks and a style reference, highlighting limitations of GAN-based approaches in jointly satisfying both priors. It introduces IIDM, a latent-diffusion image-to-image model that starts from a noise-contaminated style latent $z_T$ derived from $z_0=\operatorname{E}(\mathbf{X}_R)$ and progressively denoises under segmentation guidance to produce $\mathbf{X}_G$ via decoding $\hat{\mathbf{z}}_0$. Three plug-in inference modules—Refinement, Color Transfer, and Model Ensemble—are proposed to boost image quality and style fidelity without retraining. Evaluations on a large landscape dataset show IIDM achieves state-of-the-art results across mask accuracy, FID, and style similarity, and wins competitive benchmarks, underscoring the practicality of diffusion-based semantic synthesis with lightweight inferencing steps.

Abstract

Semantic image synthesis aims to generate high-quality images given semantic conditions, i.e. segmentation masks and style reference images. Existing methods widely adopt generative adversarial networks (GANs). GANs take all conditional inputs and directly synthesize images in a single forward step. In this paper, semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model (IIDM). Specifically, the style reference is first contaminated with random noise and then progressively denoised by IIDM, guided by segmentation masks. Moreover, three techniques, refinement, color-transfer and model ensembles, are proposed to further boost the generation quality. They are plug-in inference modules and do not require additional training. Extensive experiments show that our IIDM outperforms existing state-of-the-art methods by clear margins. Further analysis is provided via detailed demonstrations. We have implemented IIDM based on the Jittor framework; code is available at https://github.com/ader47/jittor-jieke-semantic_images_synthesis.

IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis

TL;DR

The paper addresses semantic image synthesis conditioned on segmentation masks and a style reference, highlighting limitations of GAN-based approaches in jointly satisfying both priors. It introduces IIDM, a latent-diffusion image-to-image model that starts from a noise-contaminated style latent derived from and progressively denoises under segmentation guidance to produce via decoding . Three plug-in inference modules—Refinement, Color Transfer, and Model Ensemble—are proposed to boost image quality and style fidelity without retraining. Evaluations on a large landscape dataset show IIDM achieves state-of-the-art results across mask accuracy, FID, and style similarity, and wins competitive benchmarks, underscoring the practicality of diffusion-based semantic synthesis with lightweight inferencing steps.

Abstract

Semantic image synthesis aims to generate high-quality images given semantic conditions, i.e. segmentation masks and style reference images. Existing methods widely adopt generative adversarial networks (GANs). GANs take all conditional inputs and directly synthesize images in a single forward step. In this paper, semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model (IIDM). Specifically, the style reference is first contaminated with random noise and then progressively denoised by IIDM, guided by segmentation masks. Moreover, three techniques, refinement, color-transfer and model ensembles, are proposed to further boost the generation quality. They are plug-in inference modules and do not require additional training. Extensive experiments show that our IIDM outperforms existing state-of-the-art methods by clear margins. Further analysis is provided via detailed demonstrations. We have implemented IIDM based on the Jittor framework; code is available at https://github.com/ader47/jittor-jieke-semantic_images_synthesis.
Paper Structure (13 sections, 8 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 13 sections, 8 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Semantic image synthesis approaches: (a) GAN-based models generate an image that simultaneously satisfies both style and segmentation conditions in a single step; (b) our proposed IIDM is a progressive generation (denoising) process. Different conditions are into consideration at different stages of the process.
  • Figure 2: IIDM. The diffusion and denoising processes take place in the latent space via the encoder $\operatorname{E}$. The diffusion process first incorporates the style reference $\textbf{X}_R$ into $\textbf{z}_T$. The denoising process then recovers a denoised latent representation $\hat{\textbf{z}}_0$ conditioning on the segmentation map $\textbf{m}$. The generated image $\textbf{X}_G$ is decoded by $\operatorname{D}$ from $\hat{\textbf{z}}_0$.
  • Figure 3: Refinement improves the quality of generated images through iterative generation. The inference procedure can be repeated based on the synthesized image $\textbf{X}_G^k$
  • Figure 4: Color-transfer compensates for style information during the refinement process. At the beginning of a refinement round, style reference $\textbf{X}_R$ is incorporated by color-transfer.
  • Figure 5: Images generated by GauGAN, CLADE, and IIDM, given the same conditional inputs. The outputs of IIDM have higher quality and are more similar in style to the reference image than its counterparts.
  • ...and 2 more figures