All-in-One Conditioning for Text-to-Image Synthesis
Hirunima Jayasekara, Chuong Huynh, Yixuan Ren, Christabel Acquaye, Abhinav Shrivastava
TL;DR
This work introduces ASQL, a zero-shot, scene-graph–based conditioning mechanism for text-to-image diffusion that generates soft visual guidance at inference time. By leveraging a lightweight LLM to output Attribute-Size-Quantity-Location constraints and integrating them through a differentiable ASQL loss, the method improves semantic fidelity and spatial coherence for complex prompts without requiring task-specific training. The approach combines soft region guidance, fuzzy grid placement, and multiple attention-based losses to enforce accurate object counts, sizes, attributes, and spatial relations, achieving state-of-the-art results on several benchmarks. The work demonstrates substantial improvements in accuracy and diversity while maintaining efficiency, with ablations highlighting the necessity of each component and outlining directions for future generalization and refinement.
Abstract
Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
