MALeR: Improving Compositional Fidelity in Layout-Guided Generation
Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi
TL;DR
MALeR targets the core challenge of compositional fidelity in layout-guided text-to-image generation, where multiple subjects and attributes must align with user-specified layouts. It introduces a training-free framework with three key components: masked latent regularization to suppress background leakage, in-distribution latent alignment to prevent out-of-distribution artifacts, and a novel subject-attribute association loss to ensure correct binding across complex scenes. Through extensive experiments on DrawBench and HRS, MALeR demonstrates superior compositional accuracy, generation consistency, and attribute binding, supported by quantitative metrics and user studies. The approach yields high-fidelity, layout-consistent images across multi-subject prompts and multiple attributes per subject, offering practical benefits for controllable image synthesis in real-world applications.
Abstract
Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
