Table of Contents
Fetching ...

ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis

Andrea Rigo, Luca Stornaiuolo, Mauro Martino, Bruno Lepri, Nicu Sebe

TL;DR

ESPLoRA introduces a light-weight LoRA-based fine-tuning framework to enhance spatial precision in text-to-image diffusion models, reinforced by a spatially explicit Urban T2I Spatial Dataset and a geometry-grounded evaluation. The method includes TORE, a bias-aware prompt transformation, and a recursion-ready pipeline that extracts reliable 2D/3D spatial relationships from bounding boxes and depth maps to train on synthetic and natural data. Empirical results show strong gains over CoMPaSS on both 2D and 3D spatial benchmarks, with a ~13.33% improvement, while maintaining high-definition outputs and no extra inference cost. The work offers practical benefits for urban planning and design by enabling accurate spatial configurations in generated imagery, though it acknowledges limitations like relation-specific biases and potential “cheating” via object duplication. $\tau$-based strictness control in TORE further augments spatial fidelity by prioritizing high-performing relation variants without altering the underlying prompt meaning.

Abstract

Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as "in front of" or "behind". These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms CoMPaSS, the current baseline framework, on spatial consistency benchmarks.

ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis

TL;DR

ESPLoRA introduces a light-weight LoRA-based fine-tuning framework to enhance spatial precision in text-to-image diffusion models, reinforced by a spatially explicit Urban T2I Spatial Dataset and a geometry-grounded evaluation. The method includes TORE, a bias-aware prompt transformation, and a recursion-ready pipeline that extracts reliable 2D/3D spatial relationships from bounding boxes and depth maps to train on synthetic and natural data. Empirical results show strong gains over CoMPaSS on both 2D and 3D spatial benchmarks, with a ~13.33% improvement, while maintaining high-definition outputs and no extra inference cost. The work offers practical benefits for urban planning and design by enabling accurate spatial configurations in generated imagery, though it acknowledges limitations like relation-specific biases and potential “cheating” via object duplication. -based strictness control in TORE further augments spatial fidelity by prioritizing high-performing relation variants without altering the underlying prompt meaning.

Abstract

Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as "in front of" or "behind". These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms CoMPaSS, the current baseline framework, on spatial consistency benchmarks.

Paper Structure

This paper contains 17 sections, 6 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: ESPLoRA enables existing T2I diffusion models to generate challenging spatial configurations, enhancing spatial capabilities without compromising output quality or increasing generation time.
  • Figure 2: Example of a 3D relationship captured by our metric. On the left we check that the two objects overlap, on the right we compute the average depth of each object, and assign the correct 3D relationship accordingly.
  • Figure 3: Ablation study on training set size with simple (1 rel.) prompts tested on both simple and complex (2 rel.) prompts containing the right relationship. Optimal training set size is around 1800 samples.
  • Figure 4: Strict accuracy for all models, trained on synthetic images. Fine-tuning using all relationships and complex (2 rel.) prompts outperforms all other methods.
  • Figure 5: Comparison of soft accuracy on natural and synthetic images for Front and Between. Fine-tuning with synthetic images outperforms all other training configurations.
  • ...and 14 more figures