SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
Biao Liu, Yuanzhi Liang
TL;DR
This work tackles the challenge of precise object localization in text-to-image synthesis by introducing SpatialLock, a diffusion-based framework that jointly leverages grounding information and perception supervision. It introduces two modules, Position-Engaged Injection (PoI) and Position-Guided Learning (PoG), where PoI uses a Grounding-Attention to inject spatial cues and PoG uses a perception network with Enh-Fusion to refine layouts. The approach achieves state-of-the-art localization (IOU-based metrics) and fidelity (FID) on MSCOCO and Flickr, while maintaining competitive inference efficiency and using fewer trainable parameters than many baselines. The results demonstrate that enriching input signals with explicit spatial and perceptual supervision enhances both layout accuracy and image quality, with practical benefits for dataset generation and downstream perception tasks; for example, the model achieves IOU scores above $0.9$ across multiple datasets and improves downstream detector performance when synthetic data is added.
Abstract
Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.
