Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri Rekavandi, Hamid Laga, Farid Boussaid
TL;DR
This work tackles semantic fidelity and precise spatial control in text-to-image diffusion models by introducing Box-it-to-Bind-it (B2B), a training-free, plug-and-play module. B2B operates in a zero-shot framework with two stages—Object Generation and Attribute Binding—that leverage a Bayesian objective and reward-guided latent updates: object generation uses an IoU-based attention reward within bounding boxes, while attribute binding employs a KL-based reward to align attribute distributions with their corresponding objects. Implemented as a cross-attention–level intervention at a $16 \times 16$ resolution in the denoising UNet, B2B can be appended to models like Stable Diffusion and GLIGEN, improving layout adherence and attribute fidelity. Experimental results on CompBench and TIFA show state-of-the-art performance in color/texture binding and spatial reasoning, with qualitative and plug-and-play evidence confirming compatibility and effectiveness. The work presents a practical, scalable approach for precise T2I generation that can influence both research and real-world deployment of controlled diffusion models.
Abstract
While latent diffusion models (LDMs) excel at creating imaginative images, they often lack precision in semantic fidelity and spatial control over where objects are generated. To address these deficiencies, we introduce the Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models. B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance. The process encompasses two main steps: i) Object generation, which adjusts the latent encoding to guarantee object generation and directs it within specified bounding boxes, and ii) attribute binding, guaranteeing that generated objects adhere to their specified attributes in the prompt. B2B is designed as a compatible plug-and-play module for existing T2I models, markedly enhancing model performance in addressing the key challenges. We evaluate our technique using the established CompBench and TIFA score benchmarks, demonstrating significant performance improvements compared to existing methods. The source code will be made publicly available at https://github.com/nextaistudio/BoxIt2BindIt.
