Table of Contents
Fetching ...

Beyond Randomness: Understand the Order of the Noise in Diffusion

Song Yan, Min Li, Bi Xinliang, Jian Yang, Yusen Zhang, Guanye Xiong, Yunwei Lan, Tao Zhang, Wei Zhai, Zheng-Jun Zha

TL;DR

The paper challenges the view that initial diffusion noise is purely random, showing it carries actionable semantic information. It introduces a training-free two-step pipeline, Semantic Erasure via Noise Normalization and Semantic Injection via Temporal Weighting, to refine the noise space without retraining, grounded by an equivalence between denoising dynamics and semantic injection and extended to Conditional Flow Matching. The approach yields improved text–image–video–3D alignment across multiple diffusion backbones and benchmarks, supported by qualitative, quantitative, and user studies. This work offers a universal, architecture-agnostic tool for enhancing generation fidelity and semantic coherence in diffusion-based content synthesis.

Abstract

In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step "Semantic Erasure-Injection" process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.

Beyond Randomness: Understand the Order of the Noise in Diffusion

TL;DR

The paper challenges the view that initial diffusion noise is purely random, showing it carries actionable semantic information. It introduces a training-free two-step pipeline, Semantic Erasure via Noise Normalization and Semantic Injection via Temporal Weighting, to refine the noise space without retraining, grounded by an equivalence between denoising dynamics and semantic injection and extended to Conditional Flow Matching. The approach yields improved text–image–video–3D alignment across multiple diffusion backbones and benchmarks, supported by qualitative, quantitative, and user studies. This work offers a universal, architecture-agnostic tool for enhancing generation fidelity and semantic coherence in diffusion-based content synthesis.

Abstract

In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step "Semantic Erasure-Injection" process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.

Paper Structure

This paper contains 92 sections, 66 equations, 19 figures, 7 tables, 1 algorithm.

Figures (19)

  • Figure 1: By deeply investigating the patterns within the seemingly random noise of diffusion models, we design the first training-free and universal initial noise optimization method. Our approach is applicable to any content-generation model built on diffusion architectures, enabling the production of outputs that are better aligned with textual semantics and of noticeably higher quality.
  • Figure 2: Influence of Noise Semantics on Model Generation. To investigate the impact of semantics in noise on model generation, we design a minimalist diffusion model that exclusively transforms a Gaussian distribution into three elementary distributions. Diverging from conventional training paradigms, we deliberately overfit certain random seeds to distinct functional distributions during training, thereby simulating the noise-semantic coupling phenomenon observed in T2C models (see Appendix \ref{['motivation']} for training details). As illustrated in the left panel of the figure, when the semantic of the noise matches with that of the target distribution, the model exhibits superior generation quality; however, a semantic mismatch substantially degrades performance. To address this issue, we try two distinct noise processing strategies: the first involves sampling multiple mismatched noises, aggregating them via summation, and re-normalizing to a Gaussian distribution (denoted as "Erased" in the figure); the second leverages the model's own predicted velocity/noise to adjust mismatched noises (denoted as "Injected"). We observe that both approaches effectively mitigate the adverse effects of semantic mismatch and enhance generation efficacy. As illustrated in the right panel of the figure, the reason for this phenomenon is that the isotropic Gaussian distribution is transformed by the model into an anisotropic semantic distribution after extensive training. Noise sampled from this distribution exhibits semantic preference. The multi-sampling and normalization process identifies an average starting point that remains close to all distributions, while adjusting the noise using the model's predictions helps to shift the noise closer to the target manifold. Consequently, during subsequent standard generation processes, the model can perform refined generation along a significantly shortened trajectory.
  • Figure 3: Impact of noise semantics on generation across different models. We use the semantic content of images generated from specific noise with an empty prompt to represent the inherent semantics in the noise. Then, we generate outputs with two types of prompts: one consistent and one inconsistent with the noise. As shown in Fig. \ref{['motivated']}, when the noise semantics align with the prompt, the model generates significantly higher-quality results.
  • Figure 4: Alignment Comparisons of SDXL, FLUX, WAN and TRELLIS. More qualitative results can be found in Appendix \ref{['ER']}.
  • Figure 5: Impact of Sample Size in Noise Semantic Erasure and Injection on Model.
  • ...and 14 more figures