Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Ahmad Süleyman, Göksel Biricik
TL;DR
This work tackles the challenge of grounding text-to-image diffusion models with precise object descriptions and bounding-box layouts. It introduces ObjectDiffusion, which fuses ControlNet-like conditioning with GLIGEN grounding by employing GroundNet to process semantic tokens from CLIP and Fourier-embedded bounding boxes, connected to a frozen Stable Diffusion backbone via zero-convolution layers. Trained on COCO2017 and evaluated on COCO2017 validation, it achieves AP_{50}=46.6, AR=44.5, and FID=19.8, outperforming prior open-source baselines in both accuracy and image quality. Qualitative analyses demonstrate robust grounding across closed-set and open-set vocabularies, while highlighting limitations in fine-grained facial details, hands, and text rendering. Overall, the approach significantly enhances controllability and fidelity in object-level image generation, enabling more reliable scene composition from open-ended textual descriptions and precise spatial constraints.
Abstract
Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP$_{\text{50}}$ of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.
