IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis
Feng Liu, Xiaobin Chang
TL;DR
The paper addresses semantic image synthesis conditioned on segmentation masks and a style reference, highlighting limitations of GAN-based approaches in jointly satisfying both priors. It introduces IIDM, a latent-diffusion image-to-image model that starts from a noise-contaminated style latent $z_T$ derived from $z_0=\operatorname{E}(\mathbf{X}_R)$ and progressively denoises under segmentation guidance to produce $\mathbf{X}_G$ via decoding $\hat{\mathbf{z}}_0$. Three plug-in inference modules—Refinement, Color Transfer, and Model Ensemble—are proposed to boost image quality and style fidelity without retraining. Evaluations on a large landscape dataset show IIDM achieves state-of-the-art results across mask accuracy, FID, and style similarity, and wins competitive benchmarks, underscoring the practicality of diffusion-based semantic synthesis with lightweight inferencing steps.
Abstract
Semantic image synthesis aims to generate high-quality images given semantic conditions, i.e. segmentation masks and style reference images. Existing methods widely adopt generative adversarial networks (GANs). GANs take all conditional inputs and directly synthesize images in a single forward step. In this paper, semantic image synthesis is treated as an image denoising task and is handled with a novel image-to-image diffusion model (IIDM). Specifically, the style reference is first contaminated with random noise and then progressively denoised by IIDM, guided by segmentation masks. Moreover, three techniques, refinement, color-transfer and model ensembles, are proposed to further boost the generation quality. They are plug-in inference modules and do not require additional training. Extensive experiments show that our IIDM outperforms existing state-of-the-art methods by clear margins. Further analysis is provided via detailed demonstrations. We have implemented IIDM based on the Jittor framework; code is available at https://github.com/ader47/jittor-jieke-semantic_images_synthesis.
