ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis
Andrea Rigo, Luca Stornaiuolo, Mauro Martino, Bruno Lepri, Nicu Sebe
TL;DR
ESPLoRA introduces a light-weight LoRA-based fine-tuning framework to enhance spatial precision in text-to-image diffusion models, reinforced by a spatially explicit Urban T2I Spatial Dataset and a geometry-grounded evaluation. The method includes TORE, a bias-aware prompt transformation, and a recursion-ready pipeline that extracts reliable 2D/3D spatial relationships from bounding boxes and depth maps to train on synthetic and natural data. Empirical results show strong gains over CoMPaSS on both 2D and 3D spatial benchmarks, with a ~13.33% improvement, while maintaining high-definition outputs and no extra inference cost. The work offers practical benefits for urban planning and design by enabling accurate spatial configurations in generated imagery, though it acknowledges limitations like relation-specific biases and potential “cheating” via object duplication. $\tau$-based strictness control in TORE further augments spatial fidelity by prioritizing high-performing relation variants without altering the underlying prompt meaning.
Abstract
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as "in front of" or "behind". These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms CoMPaSS, the current baseline framework, on spatial consistency benchmarks.
