Table of Contents
Fetching ...

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

Ruidong Chen, Yancheng Bai, Xuanpu Zhang, Jianhao Zeng, Lanjun Wang, Dan Song, Lei Sun, Xiangxiang Chu, Anan Liu

TL;DR

LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers, and natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders.

Abstract

Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

TL;DR

LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers, and natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders.

Abstract

Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.
Paper Structure (35 sections, 14 equations, 22 figures, 4 tables)

This paper contains 35 sections, 14 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: We propose LayerBind, a training-free strategy to empower text-to-image DiT models bfl_flux1_dev_modelcard_2025stability_sd3_5_large_modelcard_2025 with regional and occlusion controllability. (Top) Compared to prior methods zhan2025larenderzhang2025creatilayout, LayerBind produces customized images that respect the specified spatial layout and occlusion relations without image quality degradation. (Bottom) LayerBind is based on a context-sharing, region-branching strategy. This design inherently enables editable generation, allowing flexible modifications like changing per-region instances or visible orders.
  • Figure 2: (a, b) Observation: simply rearranging the latent structure at an early step directly manipulates the final spatial layout and occlusion order. (c) Our LayerBind scheme: initializing the instance layout first, then conducting semantic nursing for instance detail while maintaining layout and occlusions.
  • Figure 3: Overview of the LayerBind pipeline. (a) Layer-wise Instance Initialization splits early denoising into background and instance branches. Each instance generates independently while sharing background context (via Contextual Attention, CTA, Eq. \ref{['eq:contextual_update']}), then they are fused to form the initialized early latent. (b) Layer-wise Semantic Nursing reinforces following generation. It conduct layer-wise sequential CTA updates for each region, modulated by a Layer Transparency Scheduler, to refine instance details and maintain occlusions. Note: For simplicity, only image token updates are visualized; the detailed strategy will be described in the following sections.
  • Figure 4: Attention response weights of foreground to background and text across different FLUX bfl_flux1_dev_modelcard_2025 layers. We select layer 0 wei2025freefluxavrahami2025stable and layers with strong text response for hard instance binding. More analysis is presented in the Appendix \ref{['sup:method']}.
  • Figure 5: Visualization of occlusion control abilities. Compared to the previous methods, LayerBind achieves more precise layer-wise control, avoiding errors such as instance neglect and concept blending. More visualizations are available in the Appendix \ref{['sup:vis']}
  • ...and 17 more figures