Table of Contents
Fetching ...

STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation

Ruyu Wang, Xuefeng Hou, Sabrina Schmedding, Marco F. Huber

TL;DR

This work tackles layout-to-image synthesis by introducing STAY Diffusion, a diffusion-based model that leverages global per-layout object representations and self-supervised masks to achieve fine-grained control over multiple objects. Two novel components, Edge-Aware Normalization (EAN) and Styled-Mask Attention (SMA), fuse layout information into the decoder and focus attention on relevant object regions, enabling diverse, accurate, and controllable image generation. The model is trained with standard diffusion objectives and uses classifier-free guidance during sampling, while producing self-supervised semantic maps that augment conditioning. Experiments on COCO-Stuff and Visual Genome show that STAY Diffusion advances state-of-the-art in diversity, accuracy, and controllability, and ablations confirm the contributions of EAN and SMA; mask-clarity analyses highlight the importance of high-quality object maps for conditional generation. The approach offers practical benefits for domain-specific editing and data augmentation where precise object control is essential, without relying on pretrained LTGMs.

Abstract

In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.

STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation

TL;DR

This work tackles layout-to-image synthesis by introducing STAY Diffusion, a diffusion-based model that leverages global per-layout object representations and self-supervised masks to achieve fine-grained control over multiple objects. Two novel components, Edge-Aware Normalization (EAN) and Styled-Mask Attention (SMA), fuse layout information into the decoder and focus attention on relevant object regions, enabling diverse, accurate, and controllable image generation. The model is trained with standard diffusion objectives and uses classifier-free guidance during sampling, while producing self-supervised semantic maps that augment conditioning. Experiments on COCO-Stuff and Visual Genome show that STAY Diffusion advances state-of-the-art in diversity, accuracy, and controllability, and ablations confirm the contributions of EAN and SMA; mask-clarity analyses highlight the importance of high-quality object maps for conditional generation. The approach offers practical benefits for domain-specific editing and data augmentation where precise object control is essential, without relying on pretrained LTGMs.

Abstract

In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.

Paper Structure

This paper contains 24 sections, 10 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: ours can produce high-quality images based on given bounding box layouts. (a) A comparison between ours and previous SOTA methods. (b) In addition, ours learns a self-supervised semantic map for each layout and can manipulate object appearance by resampling its associated latent code $z^{sty}$ (cf. the towers of the building).
  • Figure 2: The architecture of the proposed ours. Through the guidance of the given layout, the learnable object representations $o^{sty}$, and their respective initial masks $M^0$, the model gradually turns a noisy image into a realistic real-world scene.
  • Figure 3: The workflow of the ean, where object masks $M^j$ are carefully assembled into a pixel-wise weighted map $W^j$ and then extended by $o^\mathrm{sty}$ for computing the modulation parameters $\gamma$ and $\beta$. Note that in $m^\mathrm{non}_4$ and $m^\mathrm{non}_5$ (highlighted in green), some pixels are removed (i.e., set to 0) due to overlapping with other smaller objects. See text for more details.
  • Figure 4: Qualitative comparison to the SOTA methods on COCO-stuff 256 $\times$ 256. ours shows better controllability and object recognizability over previous methods (e.g., the people on the playfield in the first row and the foggy effect in the last row). Zoom in for better view.
  • Figure 5: Given the same layout and input noise, ours can manipulate the appearance of an object (cf. the rock) by resampling its associated $z^\mathrm{sty}$ (Zoom in for better view).
  • ...and 12 more figures