Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

Ashkan Taghipour; Morteza Ghahremani; Mohammed Bennamoun; Aref Miri Rekavandi; Hamid Laga; Farid Boussaid

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri Rekavandi, Hamid Laga, Farid Boussaid

TL;DR

This work tackles semantic fidelity and precise spatial control in text-to-image diffusion models by introducing Box-it-to-Bind-it (B2B), a training-free, plug-and-play module. B2B operates in a zero-shot framework with two stages—Object Generation and Attribute Binding—that leverage a Bayesian objective and reward-guided latent updates: object generation uses an IoU-based attention reward within bounding boxes, while attribute binding employs a KL-based reward to align attribute distributions with their corresponding objects. Implemented as a cross-attention–level intervention at a $16 \times 16$ resolution in the denoising UNet, B2B can be appended to models like Stable Diffusion and GLIGEN, improving layout adherence and attribute fidelity. Experimental results on CompBench and TIFA show state-of-the-art performance in color/texture binding and spatial reasoning, with qualitative and plug-and-play evidence confirming compatibility and effectiveness. The work presents a practical, scalable approach for precise T2I generation that can influence both research and real-world deployment of controlled diffusion models.

Abstract

While latent diffusion models (LDMs) excel at creating imaginative images, they often lack precision in semantic fidelity and spatial control over where objects are generated. To address these deficiencies, we introduce the Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models. B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance. The process encompasses two main steps: i) Object generation, which adjusts the latent encoding to guarantee object generation and directs it within specified bounding boxes, and ii) attribute binding, guaranteeing that generated objects adhere to their specified attributes in the prompt. B2B is designed as a compatible plug-and-play module for existing T2I models, markedly enhancing model performance in addressing the key challenges. We evaluate our technique using the established CompBench and TIFA score benchmarks, demonstrating significant performance improvements compared to existing methods. The source code will be made publicly available at https://github.com/nextaistudio/BoxIt2BindIt.

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

TL;DR

resolution in the denoising UNet, B2B can be appended to models like Stable Diffusion and GLIGEN, improving layout adherence and attribute fidelity. Experimental results on CompBench and TIFA show state-of-the-art performance in color/texture binding and spatial reasoning, with qualitative and plug-and-play evidence confirming compatibility and effectiveness. The work presents a practical, scalable approach for precise T2I generation that can influence both research and real-world deployment of controlled diffusion models.

Abstract

Paper Structure (12 sections, 10 equations, 8 figures, 7 tables)

This paper contains 12 sections, 10 equations, 8 figures, 7 tables.

Introduction
Related Works
Proposed Method
Object Generation
Attribute Binding
Implementation
Experiments
Experimental Setup
Benchmark Results
Plug-and-Play Analysis
Ablation Study
Conclusion

Figures (8)

Figure 1: The proposed Box-it-to-Bind-it (B2B) is a training-free, plug-and-play tool. It is designed to enhance the performance of latent diffusion models (LDMs) such as Stable Diffusion rombach2022highresolution and GLIGEN li2023gligen. Its primary function is to improve the generation of objects and then bind their attributes within a specified layout.
Figure 2: The framework of the proposed B2B method. Given a prompt, it first enters an LLM (here GPT-4) to extract the corresponding bounding box coordinates for each object in the text, the object tokens, and their respective attributes. In the latent space, this information is fed into the $16 \times 16$ cross-attention layer of the denoising UNet at specified timesteps $\mathcal{T}_t$. The generation module ensures the generation of each object in the prompt and adherence to each object in the given layout while the binding module is applied for attribute binding.
Figure 3: IoU-based framework for object generation. As LDMs do not generally position objects within their designated bounding boxes, we enforce LDMs to generate objects centered within the specified bounding box by exerting additional $N$ boxes that push them away from the borders.
Figure 4: Asymmetrical distance KL pushes attributes' distribution toward their corresponding objects' in the cross-attention maps. Since the attention maps of the objects are previously enriched during the generation stage, the distribution push from attributes to their objects yields meaningful attention weights.
Figure 5: Visual comparison of methods for different scenarios, including color binding, texture binding, and spatial reasoning.
...and 3 more figures

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

TL;DR

Abstract

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)