GroundingBooth: Grounding Text-to-Image Customization
Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan Jacobs
TL;DR
GroundingBooth addresses the lack of precise spatial control in text-to-image customization by introducing a grounding module and a subject-grounded cross-attention mechanism that jointly ground foreground subjects and background objects. The framework learns grounded embeddings from prompt-layout and object-layout signals and restricts attention to target regions, enabling zero-shot, instance-level layout control while preserving subject identity. Trained on diverse, non-finetuned data but capable of multi-subject inferences, it demonstrates improved layout grounding, identity preservation, and text alignment across single-, multi-subject, and complex-layout scenarios, with ablations confirming the effectiveness of both core components. The work advances controllable image synthesis with practical impact for fine-grained content creation, reducing reliance on test-time fine-tuning and enabling scalable, layout-aware customization.
Abstract
Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.
