Table of Contents
Fetching ...

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

Yumeng Li, Margret Keuper, Dan Zhang, Anna Khoreva

TL;DR

This work tackles the challenge of layout-to-image diffusion synthesis by introducing ALDM, which adds a segmentation-based adversarial discriminator to explicitly enforce per-pixel layout alignment during diffusion training. A multistep unrolling strategy is proposed to promote consistency of layout adherence across the sampling horizon, with the training objective combining the standard denoising loss $L_{noise}$ and an adversarial term $\lambda_{adv} L_{adv}$ as $L_{DM} = L_{noise} + \\lambda_{adv} L_{adv}$. Empirically, ALDM achieves strong layout faithfulness while preserving text editability, outperforming or matching competitive baselines on ADE20K and Cityscapes, and it yields substantial domain-generalization gains for semantic segmentation when used to augment training data (roughly 12 mIoU points). The approach offers a practical pathway to generate faithful, editable, and diverse synthetic data for real-world tasks such as autonomous driving, where robust generalization across domains is critical.

Abstract

Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

TL;DR

This work tackles the challenge of layout-to-image diffusion synthesis by introducing ALDM, which adds a segmentation-based adversarial discriminator to explicitly enforce per-pixel layout alignment during diffusion training. A multistep unrolling strategy is proposed to promote consistency of layout adherence across the sampling horizon, with the training objective combining the standard denoising loss and an adversarial term as . Empirically, ALDM achieves strong layout faithfulness while preserving text editability, outperforming or matching competitive baselines on ADE20K and Cityscapes, and it yields substantial domain-generalization gains for semantic segmentation when used to augment training data (roughly 12 mIoU points). The approach offers a practical pathway to generate faithful, editable, and diverse synthetic data for real-world tasks such as autonomous driving, where robust generalization across domains is critical.

Abstract

Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).
Paper Structure (26 sections, 9 equations, 17 figures, 8 tables)

This paper contains 26 sections, 9 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: In contrast to prior L2I synthesis methods xue2023freestylezhang2023controlnet, our ALDM model can synthesize faithful samples that are well aligned with the layout input, while preserving controllability via text prompt. Equipped with these both valuable properties, we can synthesize diverse samples of practical utility for downstream tasks, such as data augmentation for improving domain generalization of semantic segmentation models.
  • Figure 2: Method overview. To enforce faithfulness, we propose two novel training strategies to improve the traditional L2I diffusion model training (area (A)): adversarial supervision via a segmenter-based discriminator illustrated in area (B), and multistep unrolling strategy in area (C).
  • Figure 3: Qualitative comparison of faithfulness to the layout condition on ADE20K.
  • Figure 4: Visual comparison of text control between different L2I diffusion models on Cityscapes. Based on the image caption, we directly modify the underlined objects (indicated as $\rightarrow$), or append a postfix to the caption (indicated as +). In contrast to prior work, ALDM can faithfully accomplish both global scene level modification (e.g., "snowy scene") and local editing (e.g., "burning van").
  • Figure 5: Semantic segmentation results of Cityscapes $\rightarrow$ ACDC generalization using HRNet. The HRNet is trained on Cityscapes only. Augmented with diverse synthetic data generated by our ALDM, the segmentation model can make more reliable predictions under diverse conditions.
  • ...and 12 more figures