Table of Contents
Fetching ...

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

TL;DR

3DIS tackles controllable multi-instance generation by decoupling layout from fine-grained rendering, enabling compatibility with a wide range of foundational diffusion models. It introduces a layout-to-depth adapter integrated into LDM3D to generate scene depth maps for precise instance placement, and a training-free detail renderer that uses pre-trained ControlNet to render per-instance attributes guided by the depth map. A low-pass filtering strategy on ControlNet features and SAM-assisted instance localization improve coherence and reduce attribute leakage. Evaluations on COCO-Position and COCO-MIG show substantial gains in layout accuracy (AP, AP75, MIoU) and attribute rendering (IASR) compared to training-free and adapter-based baselines, and demonstrate universal rendering across SD2/SDXL. The approach offers a scalable pathway to leverage diverse foundation models for high-quality, controllable multi-instance generation.

Abstract

The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: https://github.com/limuloo/3DIS.

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

TL;DR

3DIS tackles controllable multi-instance generation by decoupling layout from fine-grained rendering, enabling compatibility with a wide range of foundational diffusion models. It introduces a layout-to-depth adapter integrated into LDM3D to generate scene depth maps for precise instance placement, and a training-free detail renderer that uses pre-trained ControlNet to render per-instance attributes guided by the depth map. A low-pass filtering strategy on ControlNet features and SAM-assisted instance localization improve coherence and reduce attribute leakage. Evaluations on COCO-Position and COCO-MIG show substantial gains in layout accuracy (AP, AP75, MIoU) and attribute rendering (IASR) compared to training-free and adapter-based baselines, and demonstrate universal rendering across SD2/SDXL. The approach offers a scalable pathway to leverage diverse foundation models for high-quality, controllable multi-instance generation.

Abstract

The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: https://github.com/limuloo/3DIS.

Paper Structure

This paper contains 21 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Images generated using our 3DIS. Based on the user-provided layout, 3DIS generates a scene depth map that precisely positions each instance and renders their fine-grained attributes without the need for additional training, using a variety of foundational models.
  • Figure 2: The overview of 3DIS. 3DIS decouples image generation into two stages: creating a scene depth map and rendering high-quality RGB images with various generative models. It first trains a Layout-to-Depth model to generate a scene depth map. Then, it uses a pre-trained ControlNet to inject depth information into various generative models, controlling scene representation. Finally, a training-free detail renderer renders the fine-grained attributes of each instance.
  • Figure 3: Qualitative results on the COCO-Position (§\ref{['sec:compare']}).
  • Figure 4: Qualitative results on the COCO-MIG (§\ref{['sec:compare']}).
  • Figure 5: Visualization of the Impact of Low-Pass Filtering on ControlNet (§\ref{['sec:ablation']}).
  • ...and 7 more figures