Table of Contents
Fetching ...

MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi

TL;DR

MALeR targets the core challenge of compositional fidelity in layout-guided text-to-image generation, where multiple subjects and attributes must align with user-specified layouts. It introduces a training-free framework with three key components: masked latent regularization to suppress background leakage, in-distribution latent alignment to prevent out-of-distribution artifacts, and a novel subject-attribute association loss to ensure correct binding across complex scenes. Through extensive experiments on DrawBench and HRS, MALeR demonstrates superior compositional accuracy, generation consistency, and attribute binding, supported by quantitative metrics and user studies. The approach yields high-fidelity, layout-consistent images across multi-subject prompts and multiple attributes per subject, offering practical benefits for controllable image synthesis in real-world applications.

Abstract

Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.

MALeR: Improving Compositional Fidelity in Layout-Guided Generation

TL;DR

MALeR targets the core challenge of compositional fidelity in layout-guided text-to-image generation, where multiple subjects and attributes must align with user-specified layouts. It introduces a training-free framework with three key components: masked latent regularization to suppress background leakage, in-distribution latent alignment to prevent out-of-distribution artifacts, and a novel subject-attribute association loss to ensure correct binding across complex scenes. Through extensive experiments on DrawBench and HRS, MALeR demonstrates superior compositional accuracy, generation consistency, and attribute binding, supported by quantitative metrics and user studies. The approach yields high-fidelity, layout-consistent images across multi-subject prompts and multiple attributes per subject, offering practical benefits for controllable image synthesis in real-world applications.

Abstract

Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.

Paper Structure

This paper contains 32 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: MALeR (SDXL) outperforms Bounded Attention dahary2024yourself on complex prompts with multiple subjects and attributes. Both BA and MALeR use the same seed. The prompts from L2R are: 1. A realistic photo of a brown wooden chicken and a gray metallic dog. 2. A realistic photo of a blue crystal bear and a brown wooden cat and a yellow fluffy dog. 3. A realistic photo of a shiny red crystal cat and a black matte plastic dog and a rustic bronze eagle and a glowing amber cat. 4. A round pizza and a square pizza and a triangle pizza. 5. A realistic photo of two red glass sphere and a blue glass sphere and two green glass sphere and a yellow glass sphere and a white glass sphere. 6. A black and white concept art of a crashed spaceship partially buried in icy landscape and a red hooded person is watching it from a distance. 7. A black and white concept art of a destroyed apocalyptic city covered with snow and a decaying teddy bear on a bench with four red balloons tied.
  • Figure 2: MALeR overview. We illustrate three key components for layout-guided compositional scene generation. (a) Masked latent regularization prevents background semantic leakage, (b) KL-based alignment keeps the latent in-distribution during optimization, and (c) layout-guided subject-attribute association enables accurate compositional binding.
  • Figure 3: We compare generated images on DrawBench. Each row uses the same random seed. The text prompt is shown above and layout in column 1. We compare MALeR (SD) against: Attention Refocusing phung2024grounded, BoxDiff xie2023boxdiff, Layout Guidance chen2024training, and R&B xiao2024rnb. Bounded Attention (BA) dahary2024yourself with SDXL is compared against MALeR (SDXL). MALeR shows strong adherence to the prompt (subjects, attributes, and layout), generates high quality images without background semantic leakage, and correctly localizes the subjects.
  • Figure 4: Color variation across the wizard and the thunderbolt for the prompt: A concept art of an icy landscape with a {red | black} robe wizard summoning a {pink | green | purple | blue} colored magic thunderbolt from air. All images are generated with the same random seed.
  • Figure 5: Qualitative ablation on the impact of our various loss terms in \ref{['eq:final_loss']}. From L2R, columns are: (a) layout prompt, (b) output of $\mathcal{L}_\text{iou}$ only, (c) effect of masked latent regularization ($\mathcal{L}_\text{iou} + \mathcal{L}_\text{mask}$), and (d) together with KL regularization ($\mathcal{L}_\text{iou} + \mathcal{L}_\text{mask} + \mathcal{L}_\text{KL}$). Next, we present the impact of adding attribute losses to (d): (e) similarity loss ($+\mathcal{L}_\text{sim}$), (f) dissimilarity loss ($+\mathcal{L}_\text{dis}$), and (g) full attribute loss ($+\mathcal{L}_\text{att}$) consisting of all loss terms, and corresponding to our final approach, MALeR. All images are generated using seed 0.
  • ...and 3 more figures