Table of Contents
Fetching ...

GroundingBooth: Grounding Text-to-Image Customization

Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan Jacobs

TL;DR

GroundingBooth addresses the lack of precise spatial control in text-to-image customization by introducing a grounding module and a subject-grounded cross-attention mechanism that jointly ground foreground subjects and background objects. The framework learns grounded embeddings from prompt-layout and object-layout signals and restricts attention to target regions, enabling zero-shot, instance-level layout control while preserving subject identity. Trained on diverse, non-finetuned data but capable of multi-subject inferences, it demonstrates improved layout grounding, identity preservation, and text alignment across single-, multi-subject, and complex-layout scenarios, with ablations confirming the effectiveness of both core components. The work advances controllable image synthesis with practical impact for fine-grained content creation, reducing reliance on test-time fine-tuning and enabling scalable, layout-aware customization.

Abstract

Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.

GroundingBooth: Grounding Text-to-Image Customization

TL;DR

GroundingBooth addresses the lack of precise spatial control in text-to-image customization by introducing a grounding module and a subject-grounded cross-attention mechanism that jointly ground foreground subjects and background objects. The framework learns grounded embeddings from prompt-layout and object-layout signals and restricts attention to target regions, enabling zero-shot, instance-level layout control while preserving subject identity. Trained on diverse, non-finetuned data but capable of multi-subject inferences, it demonstrates improved layout grounding, identity preservation, and text alignment across single-, multi-subject, and complex-layout scenarios, with ablations confirming the effectiveness of both core components. The work advances controllable image synthesis with practical impact for fine-grained content creation, reducing reliance on test-time fine-tuning and enabling scalable, layout-aware customization.

Abstract

Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.
Paper Structure (33 sections, 6 equations, 13 figures, 6 tables)

This paper contains 33 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: We propose GroundingBooth, a framework for grounded text-to-image customization. GroundingBooth supports: (a) grounded single-subject customization, and (b) joint grounded customization for multi-subjects and text entities. GroundingBooth achieves prompt following, layout grounding for both subjects and background objects, and identity preservation of subjects simultaneously.
  • Figure 2: Inference pipeline of GroundingBooth. It contains two steps: (1) Feature extraction. We use the CLIP encoder and DINOv2 encoder to extract prompt and image tokens, respectively. We use our proposed Grounding Module to extract grounding tokens from layout and text entities. (2) Grounded feature integration. We propose a Subject-Grounded Cross-Attention Layer in each transformer block to integrate the subject image tokens, text tokens, and grounding tokens. Note that the model is trained with a single subject per image, but generalizes well to multiple subjects during inference.
  • Figure 3: Grounding Module: Our grounding module takes both the prompt-layout pairs and reference object-layout pairs as input. For the foreground reference object, both CLIP text token and the DINOv2 image class token are utilized.
  • Figure 4: Subject-Grounded Cross-Attention: Q, K, and V are visual query, key, and value respectively, and A is the affinity matrix.
  • Figure 5: Visual comparison with existing methods for the single-subject customization task. Zoom in to see the details.
  • ...and 8 more figures