Table of Contents
Fetching ...

SpatialLock: Precise Spatial Control in Text-to-Image Synthesis

Biao Liu, Yuanzhi Liang

TL;DR

This work tackles the challenge of precise object localization in text-to-image synthesis by introducing SpatialLock, a diffusion-based framework that jointly leverages grounding information and perception supervision. It introduces two modules, Position-Engaged Injection (PoI) and Position-Guided Learning (PoG), where PoI uses a Grounding-Attention to inject spatial cues and PoG uses a perception network with Enh-Fusion to refine layouts. The approach achieves state-of-the-art localization (IOU-based metrics) and fidelity (FID) on MSCOCO and Flickr, while maintaining competitive inference efficiency and using fewer trainable parameters than many baselines. The results demonstrate that enriching input signals with explicit spatial and perceptual supervision enhances both layout accuracy and image quality, with practical benefits for dataset generation and downstream perception tasks; for example, the model achieves IOU scores above $0.9$ across multiple datasets and improves downstream detector performance when synthetic data is added.

Abstract

Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.

SpatialLock: Precise Spatial Control in Text-to-Image Synthesis

TL;DR

This work tackles the challenge of precise object localization in text-to-image synthesis by introducing SpatialLock, a diffusion-based framework that jointly leverages grounding information and perception supervision. It introduces two modules, Position-Engaged Injection (PoI) and Position-Guided Learning (PoG), where PoI uses a Grounding-Attention to inject spatial cues and PoG uses a perception network with Enh-Fusion to refine layouts. The approach achieves state-of-the-art localization (IOU-based metrics) and fidelity (FID) on MSCOCO and Flickr, while maintaining competitive inference efficiency and using fewer trainable parameters than many baselines. The results demonstrate that enriching input signals with explicit spatial and perceptual supervision enhances both layout accuracy and image quality, with practical benefits for dataset generation and downstream perception tasks; for example, the model achieves IOU scores above across multiple datasets and improves downstream detector performance when synthetic data is added.

Abstract

Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.

Paper Structure

This paper contains 44 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Compared to prior methods, our method uses cross-attention and a perception module to ensure the sufficient encoding of location information, additionally, we incorporate Enh-Fusion into the perception module to further integrate global information.
  • Figure 2: SpatialLock incorporates PoI and PoG into the original Stable Diffusion framework. PoI learns grounding information directly through Grounding-Attention to generate the precise positional layout, and PoG provides an extra supervisor by perception information to refine the object location.
  • Figure 3: Compared to other advanced methods on MSCOCO, our model produces more realistic images with superior spatial layout, especially for small objects (e.g., the correctly placed baseball glove). Note that the performance of DetDiffusion is the result of our reproduction and may differ from the original.
  • Figure 4: Show the importance of text-captions in generated images, significantly influencing the style, appearance, and layout of the object entities.
  • Figure 5: The capability of our model to generate various scenes on the Flickr dataset, including spatial position, Multi-objective generation, and detail of human facial features.
  • ...and 3 more figures