Beyond Randomness: Understand the Order of the Noise in Diffusion
Song Yan, Min Li, Bi Xinliang, Jian Yang, Yusen Zhang, Guanye Xiong, Yunwei Lan, Tao Zhang, Wei Zhai, Zheng-Jun Zha
TL;DR
The paper challenges the view that initial diffusion noise is purely random, showing it carries actionable semantic information. It introduces a training-free two-step pipeline, Semantic Erasure via Noise Normalization and Semantic Injection via Temporal Weighting, to refine the noise space without retraining, grounded by an equivalence between denoising dynamics and semantic injection and extended to Conditional Flow Matching. The approach yields improved text–image–video–3D alignment across multiple diffusion backbones and benchmarks, supported by qualitative, quantitative, and user studies. This work offers a universal, architecture-agnostic tool for enhancing generation fidelity and semantic coherence in diffusion-based content synthesis.
Abstract
In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step "Semantic Erasure-Injection" process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.
