Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Jonghyun Lee; Hansam Cho; Youngjoon Yoo; Seoung Bum Kim; Yonghyun Jeong

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Jonghyun Lee, Hansam Cho, Youngjoon Yoo, Seoung Bum Kim, Yonghyun Jeong

TL;DR

Compose and Conquer (CnC) addresses the limitations of text-only conditioning in diffusion models by enabling 3D depth-aware object placement and region-specific global semantics. It introduces depth disentanglement training (DDT) for relative depth understanding using synthetic image triplets and soft guidance to localize global semantics onto targeted image regions, all integrated via a local fuser and a global fuser atop a frozen Stable Diffusion backbone. The approach demonstrates improved depth ordering, reduced semantic bleeding, and robust reconstruction on real and synthetic datasets, with extensive qualitative and quantitative validation and ablations. The work provides a reproducible, multi-signal conditioning framework that expands controllable diffusion synthesis toward more realistic, depth-aware, and semantically rich images.

Abstract

Addressing the limitations of text as a source of accurate layout representation in text-conditional diffusion models, many works incorporate additional signals to condition certain attributes within a generated image. Although successful, previous works do not account for the specific localization of said attributes extended into the three dimensional plane. In this context, we present a conditional diffusion model that integrates control over three-dimensional object placement with disentangled representations of global stylistic semantics from multiple exemplar images. Specifically, we first introduce \textit{depth disentanglement training} to leverage the relative depth of objects as an estimator, allowing the model to identify the absolute positions of unseen objects through the use of synthetic image triplets. We also introduce \textit{soft guidance}, a method for imposing global semantics onto targeted regions without the use of any additional localization cues. Our integrated framework, \textsc{Compose and Conquer (CnC)}, unifies these techniques to localize multiple conditions in a disentangled manner. We demonstrate that our approach allows perception of objects at varying depths while offering a versatile framework for composing localized objects with different global semantics. Code: https://github.com/tomtom1103/compose-and-conquer/

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 17 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 2 equations, 17 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Conditional Diffusion Models.
Beyond Text Conditions.
Methodology: Compose and Conquer
Generative Prior Utilization
Local Fuser
Synthetic Image Triplets.
Depth Disentanglement Training.
Global Fuser
Soft Guidance.
Training
Experiments
Experimental Setup
Datasets.
...and 25 more sections

Figures (17)

Figure 1: Compose and Conquer is able to localize both local and global conditions in a 3D depth aware manner. For details on the figure, see Section \ref{['sec:introduction']}.
Figure 2: Model Architecture. Our model consists of a local fuser, a global fuser, and the cloned encoder/center block $\{E', C'\}$. The input depth maps are fed into the local fuser, producing four latent representations of different spatial resolutions, which are incorporated into $E'$. The CLIP image embeddings are fed into the global fuser, producing 2 extra tokens to be concatenated with the text token embeddings. Masks $M$ are flattened and repeated to produce $M'=\operatorname{concat}(J, \varphi(M), 1-\varphi(M))$, which serves as a source of soft guidance of the cross-attention layers.
Figure 3: Depth disentanglement training. Our model trained on DDT (Left) successfully recognizes that objects portrayed by the foreground depth map (Top left) should be placed closer that the background depth map (Bottom left), and fully occludes objects that are larger. On the other hand, when trained on just the depth maps of $I_s$(Right), our model struggles to disentangle the depth maps, resulting in either objects becoming fused (samples (a), (c)) or completely ignoring the foreground object (sample (b)).
Figure 4: Samples compared to other baseline models. Compared to others, CnC strikes a balance between the given depth maps, exemplar images, and text prompts.
Figure 5: Qualitative Results. Foreground/background conditions are on the left of each sample.
...and 12 more figures

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

TL;DR

Abstract

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (17)