Edge-preserving noise for diffusion models

Jente Vandersanden; Sascha Holl; Xingchang Huang; Gurprit Singh

Edge-preserving noise for diffusion models

Jente Vandersanden, Sascha Holl, Xingchang Huang, Gurprit Singh

TL;DR

This work introduces edge-preserving diffusion, a content-aware generalization of isotropic diffusion, by employing a forward hybrid process that initially preserves edges and gradually transitions to isotropic noise. The forward process uses a transition function $τ(t)$ with a transition point $t_Φ$ and a time-varying edge sensitivity $λ(t)$ to modulate noise based on image structure, while training optimizes a network to predict non-isotropic noise with a loss $L = || f_θ(x_t,t) - σ_t ε_t ||^2$. Backward posteriors and training are adapted to this non-isotropic setting, using tensor variances and a corresponding analytic update, enabling faster convergence and better learning of low-to-mid frequency content. Empirically, the method yields up to 30% improvements in FID and CLIP scores across unconditional and shape-guided tasks, including stroke-based generation, with minimal computational overhead. Overall, edge-preserving diffusion improves structural fidelity and robustness, offering a practical enhancement to diffusion-based generation with strong potential for downstream editing and shape-guided synthesis.

Abstract

Classical generative diffusion models learn an isotropic Gaussian denoising process, treating all spatial regions uniformly, thus neglecting potentially valuable structural information in the data. Inspired by the long-established work on anisotropic diffusion in image processing, we present a novel edge-preserving diffusion model that generalizes over existing isotropic models by considering a hybrid noise scheme. In particular, we introduce an edge-aware noise scheduler that varies between edge-preserving and isotropic Gaussian noise. We show that our model's generative process converges faster to results that more closely match the target distribution. We demonstrate its capability to better learn the low-to-mid frequencies within the dataset, which plays a crucial role in representing shapes and structural information. Our edge-preserving diffusion process consistently outperforms state-of-the-art baselines in unconditional image generation. It is also particularly more robust for generative tasks guided by a shape-based prior, such as stroke-to-image generation. We present qualitative and quantitative results (FID and CLIP score) showing consistent improvements of up to 30% for both tasks.

Edge-preserving noise for diffusion models

TL;DR

with a transition point

and a time-varying edge sensitivity

to modulate noise based on image structure, while training optimizes a network to predict non-isotropic noise with a loss

. Backward posteriors and training are adapted to this non-isotropic setting, using tensor variances and a corresponding analytic update, enabling faster convergence and better learning of low-to-mid frequency content. Empirically, the method yields up to 30% improvements in FID and CLIP scores across unconditional and shape-guided tasks, including stroke-based generation, with minimal computational overhead. Overall, edge-preserving diffusion improves structural fidelity and robustness, offering a practical enhancement to diffusion-based generation with strong potential for downstream editing and shape-guided synthesis.

Abstract

Paper Structure (28 sections, 28 equations, 15 figures, 6 tables)

This paper contains 28 sections, 28 equations, 15 figures, 6 tables.

Introduction
Related work
Background
Generative diffusion processes.
Denoising probabilistic model.
Edge-preserving filters in image processing.
An edge-preserving generative process
Forward hybrid noise scheme
Time-varying edge sensitivity $\lambda(t)$
Backward process posteriors and training
Experiments
Implementation details
Unconditional image generation
Stroke-guided image generation (SDEdit)
Ablation study
...and 13 more sections

Figures (15)

Figure 1: A classic isotropic diffusion process (top row) is compared to our hybrid edge-aware diffusion process (middle row) on the left side. We propose a hybrid noise (bottom row) that progressively changes from anisotropic ($t=0$) to isotropic noise ($t=499$). We use our edge-aware noise for training and inference. On the right, we compare both noise schemes on the SDEdit framework meng2021sdedit for stroke-based image generation. Our model consistently outperforms DDPM's isotropic scheme, is more robust against visual artifacts and produces sharper outputs without missing structural details.
Figure 2: We visually compare the impact of our edge-preserving noise on the generative process. In each column, we show predictions $\hat{\mathbf{x}}_0$ at selected time steps. Our method converges significantly faster to a sharper and less noisy image than its isotropic counterpart. This is evident by the earlier emergence (from $t=400$) of structural details like the pattern on the cat's head, eyes, and whiskers with our approach.
Figure 3: We compare unconditionally generated samples for IHDM, BNDM and DDPM with our model. While qualitative improvements are subtle, ours performs consistently better quantitatively. Corresponding FID scores can be found in \ref{['tab:quantitative']}. Additional results are presented in the appendix.
Figure 4: Left: Various diffusion models applied to the SDEdit framework meng2021sdedit are shown. The leftmost column displays the stroke-based guide (via k-means clustering applied to an image), with the other three columns showing the model outputs. Overall, our model shows sharper details with less distortions compared to other models, leading to a better visual and quantitative performance. The corresponding FID scores are shown in the top right column. Right: Our model also effectively uses human-drawn paintings as shape guides, with particularly precise adherence to details, such as the orange patches on the cat's fur, unlike DDPM (middle column).
Figure 5: Generated unconditional samples for the Human Sketch ($128^2$) dataset eitz2012hdhso. All models were trained for an equal amount of 575 epochs. Note that the FID scores are inconsistent with visual quality. The cause for this is the Inception-v3 backbone, which is designed for continuous image data, leading to highly unstable results when applied to high-frequency binary data like hand-drawn sketches.
...and 10 more figures

Edge-preserving noise for diffusion models

TL;DR

Abstract

Edge-preserving noise for diffusion models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)