Table of Contents
Fetching ...

3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

TL;DR

The paper tackles the inefficiency of adapter-based MIG methods that require retraining with new models. It extends the depth-driven 3DIS framework by integrating FLUX (a Diffusion Transformer) for training-free, high-quality rendering, using a two-stage pipeline: depth-map construction from a layout followed by depth-to-image rendering with careful attention control. A novel FLUX Detail Renderer constrains Joint Attention to preserve per-instance attributes, achieving precise multi-instance rendering using a per-instance text encoding scheme. On COCO-MIG, 3DIS-FLUX outperforms the prior 3DIS and state-of-the-art adapters and training-free methods in ISR, while maintaining high image quality, illustrating the approach’s practicality and adaptability to newer diffusion backbones.

Abstract

The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.

3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

TL;DR

The paper tackles the inefficiency of adapter-based MIG methods that require retraining with new models. It extends the depth-driven 3DIS framework by integrating FLUX (a Diffusion Transformer) for training-free, high-quality rendering, using a two-stage pipeline: depth-map construction from a layout followed by depth-to-image rendering with careful attention control. A novel FLUX Detail Renderer constrains Joint Attention to preserve per-instance attributes, achieving precise multi-instance rendering using a per-instance text encoding scheme. On COCO-MIG, 3DIS-FLUX outperforms the prior 3DIS and state-of-the-art adapters and training-free methods in ISR, while maintaining high image quality, illustrating the approach’s practicality and adaptability to newer diffusion backbones.

Abstract

The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.
Paper Structure (12 sections, 5 figures, 2 tables)

This paper contains 12 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Images generated using our 3DIS-FLUX. Based on the user-provided layout, 3DIS zhou20243dis generates a scene depth map that precisely positions each instance and renders their fine-grained attributes without the need for additional training, using a variety of foundational models. Specifically, 3DIS-FLUX employs the state-of-the-art FLUX model for rendering, which is capable of producing superior image quality and offering enhanced control.
  • Figure 2: The overview of 3DIS-FLUX. In line with 3DIS, the 3DIS-FLUX approach decouples image generation into two distinct stages: the creation of a scene depth map and the training-free rendering of high-quality RGB images using various generative models. 3DIS-FLUX utilizes the Layout-to-Depth model from 3DIS to generate the scene depth map, and subsequently employs the FLUX-depth model to render images based on the depth map. During this process, 3DIS-FLUX incorporates an Attention Controller to ensure the accurate fine-grained attributes of each instance.
  • Figure 3: Qualitative results on the COCO-MIG (§\ref{['sec:compare']}).
  • Figure 4: Ablation Study on the FLUX Detail Renderer.
  • Figure 5: Ablation Study on Controlling Text-to-Text Attention in the FLUX Detail Renderer.