Table of Contents
Fetching ...

All-in-One Conditioning for Text-to-Image Synthesis

Hirunima Jayasekara, Chuong Huynh, Yixuan Ren, Christabel Acquaye, Abhinav Shrivastava

TL;DR

This work introduces ASQL, a zero-shot, scene-graph–based conditioning mechanism for text-to-image diffusion that generates soft visual guidance at inference time. By leveraging a lightweight LLM to output Attribute-Size-Quantity-Location constraints and integrating them through a differentiable ASQL loss, the method improves semantic fidelity and spatial coherence for complex prompts without requiring task-specific training. The approach combines soft region guidance, fuzzy grid placement, and multiple attention-based losses to enforce accurate object counts, sizes, attributes, and spatial relations, achieving state-of-the-art results on several benchmarks. The work demonstrates substantial improvements in accuracy and diversity while maintaining efficiency, with ablations highlighting the necessity of each component and outlining directions for future generalization and refinement.

Abstract

Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.

All-in-One Conditioning for Text-to-Image Synthesis

TL;DR

This work introduces ASQL, a zero-shot, scene-graph–based conditioning mechanism for text-to-image diffusion that generates soft visual guidance at inference time. By leveraging a lightweight LLM to output Attribute-Size-Quantity-Location constraints and integrating them through a differentiable ASQL loss, the method improves semantic fidelity and spatial coherence for complex prompts without requiring task-specific training. The approach combines soft region guidance, fuzzy grid placement, and multiple attention-based losses to enforce accurate object counts, sizes, attributes, and spatial relations, achieving state-of-the-art results on several benchmarks. The work demonstrates substantial improvements in accuracy and diversity while maintaining efficiency, with ablations highlighting the necessity of each component and outlining directions for future generalization and refinement.

Abstract

Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.
Paper Structure (14 sections, 12 equations, 5 figures, 3 tables)

This paper contains 14 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our ASQL Conditioning Pipeline. Given a text condition, we generate intermediate conditions to control the image denoising process at each time step. All properties (size, quantity, attributes) of each entity and relationships between entities are considered to achive the best result.
  • Figure 2: Qualitative results for proposed method. First two columns: images generated from baselines SDv2.1 and Attend-n-Excite with SDv2.1 base, Third column: image generated from proposed method.
  • Figure 3: Qualitative results for proposed method. We compare our results on two objects and position with Attend-n-Excite with SDv2.1 base.
  • Figure 4: Example of Soft Region masking and effect of dice loss and size loss. First column shows the generated image from SD and Attend and excite. Second column illustrate the image generated with proposed pipeline.
  • Figure 5: Failed Generations. First two columns: images generated overly literal manner, Last three columns: entity is either absent or too complex to be fully captured by the text prompt.