Table of Contents
Fetching ...

Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation

Ahmad Süleyman, Göksel Biricik

TL;DR

This work tackles the challenge of grounding text-to-image diffusion models with precise object descriptions and bounding-box layouts. It introduces ObjectDiffusion, which fuses ControlNet-like conditioning with GLIGEN grounding by employing GroundNet to process semantic tokens from CLIP and Fourier-embedded bounding boxes, connected to a frozen Stable Diffusion backbone via zero-convolution layers. Trained on COCO2017 and evaluated on COCO2017 validation, it achieves AP_{50}=46.6, AR=44.5, and FID=19.8, outperforming prior open-source baselines in both accuracy and image quality. Qualitative analyses demonstrate robust grounding across closed-set and open-set vocabularies, while highlighting limitations in fine-grained facial details, hands, and text rendering. Overall, the approach significantly enhances controllability and fidelity in object-level image generation, enabling more reliable scene composition from open-ended textual descriptions and precise spatial constraints.

Abstract

Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP$_{\text{50}}$ of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.

Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation

TL;DR

This work tackles the challenge of grounding text-to-image diffusion models with precise object descriptions and bounding-box layouts. It introduces ObjectDiffusion, which fuses ControlNet-like conditioning with GLIGEN grounding by employing GroundNet to process semantic tokens from CLIP and Fourier-embedded bounding boxes, connected to a frozen Stable Diffusion backbone via zero-convolution layers. Trained on COCO2017 and evaluated on COCO2017 validation, it achieves AP_{50}=46.6, AR=44.5, and FID=19.8, outperforming prior open-source baselines in both accuracy and image quality. Qualitative analyses demonstrate robust grounding across closed-set and open-set vocabularies, while highlighting limitations in fine-grained facial details, hands, and text rendering. Overall, the approach significantly enhances controllability and fidelity in object-level image generation, enabling more reliable scene composition from open-ended textual descriptions and precise spatial constraints.

Abstract

Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.
Paper Structure (17 sections, 8 equations, 71 figures, 4 tables)

This paper contains 17 sections, 8 equations, 71 figures, 4 tables.

Figures (71)

  • Figure 1: The grounding tokens are formed by fusing the CLIP radford2021learning encoded object entities with their Fourier mildenhall10representing embedded bounding boxes. The concatenation vector is processed through an MLP popescu2009multilayer, which produces a standard-sized output vector of 768.
  • Figure 2: ObjectDiffusion architecture is divided into two parallel networks. For the first network, we utilize Stable Diffusion 1.4v rombach2022high as a pretrained text-to-image model, depicted in blue. The second network is a trainable GroundNet, which consists of the encoder and middle blocks from GLIGEN li2023gligen, represented in red. GroundNet injects the encoded conditional layout $g$, which consists of the positional and semantic tokens. During training, both networks receive the time $t$, the caption $c$, and the latent noised image input $z_t$. The two networks are connected via zero-convolution layers, highlighted in gray. ObjectDiffusion operates in latent space. The Image Encoder and Image Decoder project the input image from pixel space to latent space, and the output image from latent space back to pixel space, respectively. Our model architecture design is inspired by ControlNet zhang2023adding.
  • Figure 3: This figure highlights the difference between the resizing via Bicubic keys1981cubic interpolation that we apply and the center cropping implemented in GLIGEN. The original image is on the left, the GLIGEN li2023gligen preprocessed image is in the middle, and our preprocessed image is on the right.
  • Figure 4: The inference schema consists of two networks. The first network, displayed in red, is our GroundNet fine-tuned on COCO2017 lin2014microsoft object detection annotations. The second network, displayed in blue, is a pretrained GLIGEN li2023gligen. We replace the pretrained Stable Diffusion rombach2022high model used during training with the pretrained GLIGEN because it yields more precise grounding abilities. GLIGEN is trained on the Object365 shao2019objects365, GoldG li2022grounded (Flickr and VG), SBU ordonez2011im2text, and CC3M sharma2018conceptual datasets, but not on the COCO dataset.
  • Figure 5: Image Caption: “a young man is on his skateboard doing a trick" Conditional Entities:person, skateboard
  • ...and 66 more figures