TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
TL;DR
TAUE addresses the bottleneck of single-layer diffusion outputs by enabling zero-shot, layer-wise image generation without fine-tuning or external datasets. It introduces Noise Transplantation and Cultivation (NTC), which extracts a foreground seedling latent $L_{ ext{fg}}$ at $t_{ ext{crop}}=ig floor T(1-r_{ ext{crop}})ig floor$ and reuses it to steer the composite, then derives a background seedling latent $L_{ ext{bg}}$ to generate the background, all while using a cross-attention guided object mask $m_{ ext{obj}}$ and a Laplacian high-pass on the transplanted latent. Empirically, TAUE achieves competitive performance with fine-tuned methods and surpasses training-free baselines in layer-wise consistency, while enabling practical capabilities like layout control, multi-object disentanglement, and background replacement in a zero-shot setting. This approach reduces data and compute barriers, broadening access to controllable, modular diffusion-based image synthesis for professional workflows. TAUE thus offers a flexible, efficient path to coherent, multi-layered imagery without costly training or curated datasets.
Abstract
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
