Table of Contents
Fetching ...

TAUE: Training-free Noise Transplant and Cultivation Diffusion Model

Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi

TL;DR

TAUE addresses the bottleneck of single-layer diffusion outputs by enabling zero-shot, layer-wise image generation without fine-tuning or external datasets. It introduces Noise Transplantation and Cultivation (NTC), which extracts a foreground seedling latent $L_{ ext{fg}}$ at $t_{ ext{crop}}=ig floor T(1-r_{ ext{crop}})ig floor$ and reuses it to steer the composite, then derives a background seedling latent $L_{ ext{bg}}$ to generate the background, all while using a cross-attention guided object mask $m_{ ext{obj}}$ and a Laplacian high-pass on the transplanted latent. Empirically, TAUE achieves competitive performance with fine-tuned methods and surpasses training-free baselines in layer-wise consistency, while enabling practical capabilities like layout control, multi-object disentanglement, and background replacement in a zero-shot setting. This approach reduces data and compute barriers, broadening access to controllable, modular diffusion-based image synthesis for professional workflows. TAUE thus offers a flexible, efficient path to coherent, multi-layered imagery without costly training or curated datasets.

Abstract

Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.

TAUE: Training-free Noise Transplant and Cultivation Diffusion Model

TL;DR

TAUE addresses the bottleneck of single-layer diffusion outputs by enabling zero-shot, layer-wise image generation without fine-tuning or external datasets. It introduces Noise Transplantation and Cultivation (NTC), which extracts a foreground seedling latent at and reuses it to steer the composite, then derives a background seedling latent to generate the background, all while using a cross-attention guided object mask and a Laplacian high-pass on the transplanted latent. Empirically, TAUE achieves competitive performance with fine-tuned methods and surpasses training-free baselines in layer-wise consistency, while enabling practical capabilities like layout control, multi-object disentanglement, and background replacement in a zero-shot setting. This approach reduces data and compute barriers, broadening access to controllable, modular diffusion-based image synthesis for professional workflows. TAUE thus offers a flexible, efficient path to coherent, multi-layered imagery without costly training or curated datasets.

Abstract

Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.

Paper Structure

This paper contains 20 sections, 12 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: TAUE introduces a training-free method for layer-wise image generation. By transplanting an intermediate seedling latent from the foreground to the composite generation process, TAUE simultaneously produces a consistent foreground, background, and composite image.
  • Figure 2: The generation process consists of three stages: (1) Foreground Generation, where an object is generated from noise and a seedling latent is extracted; (2) Composite Generation, where the seedling latent is transplanted into a new denoising trajectory to generate a full scene; and (3) Background Generation, where the background is reconstructed separately from the same noise. To ensure spatial and semantic consistency, we introduce an NTC strategy, which constrains the object region with fixed seedling noise during early denoising steps and gradually relaxes this constraint to produce a coherent composite.
  • Figure 3: Illustration of the cross-attention blending mechanism. The foreground prompt is applied to object regions $m_{\text{obj}}$, while the background prompt is applied to non-object regions $1-m_{\text{obj}}$. This enables precise control over foreground-background composition, ensuring cohesive integration of both layers in the final composite scene.
  • Figure 4: Qualitative comparison of layer-wise image generation. For each case, we show the foreground, background, and composite image generated by LayerDiffuse zhang2024transparent, Alfie quattrini2024alfie, and our method. TAUE consistently produces spatially aligned and semantically coherent multi-layer outputs, achieving realistic integration of foreground and background without fine-tuning or inpainting.
  • Figure 5: Applications of TAUE. TAUE enables (a) Layout and Size Control by injecting bounding box constraints to specify the position and scale of the foreground object; (b) Disentangled Multi-Object Generation by transplanting seedling noise to multiple spatial locations, allowing compositionally coherent and semantically independent objects; and (c) Background Replacement by regenerating backgrounds while preserving the original foreground structure.
  • ...and 3 more figures