Table of Contents
Fetching ...

Spectral Collapse in Diffusion Inversion

Nicolas Bourriez, Alexandre Verine, Auguste Genovesio

TL;DR

The paper identifies spectral collapse as a key bottleneck in deterministic diffusion inversion when translating spectrally sparse inputs to spectrally dense targets. It analyzes how inversion dynamics and the choice of prediction target (ε vs $\mathbf{x}_0$) influence latent statistics, showing that naive approaches fail to recover high-frequency textures. To address this, the authors propose Orthogonal Variance Guidance (OVG), which injects high-frequency variance while constraining updates to the null-space of the structural gradient, achieving both texture realism and structural fidelity. Extensive experiments on BBBC021 and Edges2Shoes demonstrate that EDM+OVG yields superior perceptual texture quality and robust structure preservation, expanding the Pareto frontier beyond prior deterministic or stochastic methods. The work provides theoretical insights into the spectral properties of diffusion inversion and a practical, inference-time tool for unpaired image translation across spectrally asymmetric domains.

Abstract

Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.

Spectral Collapse in Diffusion Inversion

TL;DR

The paper identifies spectral collapse as a key bottleneck in deterministic diffusion inversion when translating spectrally sparse inputs to spectrally dense targets. It analyzes how inversion dynamics and the choice of prediction target (ε vs ) influence latent statistics, showing that naive approaches fail to recover high-frequency textures. To address this, the authors propose Orthogonal Variance Guidance (OVG), which injects high-frequency variance while constraining updates to the null-space of the structural gradient, achieving both texture realism and structural fidelity. Extensive experiments on BBBC021 and Edges2Shoes demonstrate that EDM+OVG yields superior perceptual texture quality and robust structure preservation, expanding the Pareto frontier beyond prior deterministic or stochastic methods. The work provides theoretical insights into the spectral properties of diffusion inversion and a practical, inference-time tool for unpaired image translation across spectrally asymmetric domains.

Abstract

Conditional diffusion inversion provides a powerful framework for unpaired image-to-image translation. However, we demonstrate through an extensive analysis that standard deterministic inversion (e.g. DDIM) fails when the source domain is spectrally sparse compared to the target domain (e.g., super-resolution, sketch-to-image). In these contexts, the recovered latent from the input does not follow the expected isotropic Gaussian distribution. Instead it exhibits a signal with lower frequencies, locking target sampling to oversmoothed and texture-poor generations. We term this phenomenon spectral collapse. We observe that stochastic alternatives attempting to restore the noise variance tend to break the semantic link to the input, leading to structural drift. To resolve this structure-texture trade-off, we propose Orthogonal Variance Guidance (OVG), an inference-time method that corrects the ODE dynamics to enforce the theoretical Gaussian noise magnitude within the null-space of the structural gradient. Extensive experiments on microscopy super-resolution (BBBC021) and sketch-to-image (Edges2Shoes) demonstrate that OVG effectively restores photorealistic textures while preserving structural fidelity.
Paper Structure (77 sections, 53 equations, 20 figures, 11 tables)

This paper contains 77 sections, 53 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Spectral collapse. Unpaired image-to-image translation via diffusion inversion fails when the source domain is spectrally sparse (e.g., Low Res $\to$ High Res). Inverted noise maps display significantly reduced high-frequency energy compared to the theoretical Gaussian prior and retain low-frequency structural imprints. Such spectrally collapsed noise maps lead to oversmoothed generations in the target domain that lack texture. Our proposed method restores the correct Gaussian spectral profile, enabling the generation of realistic high-frequency textures, while preserving structural fidelity.
  • Figure 2: Qualitative comparison on BBBC021 ($\times 16$ super-resolution) using models trained in latent space. Deterministic baselines (DDIM, DirectInv, Null-Class) suffer from spectral collapse, yielding oversmoothed outputs. Stochastic methods (ReNoise, TABA) and standard EDM exhibit artifacts or structural drift. Our EDM+OVG approach uniquely restores realistic high-frequency texture while maintaining strict structural fidelity to the input.
  • Figure 3: Latent Entropy Analysis. We plot the mean of Cumulative Explained Variance over 2000 noise maps via Principal Component Analysis (PCA). In contrast to isotropic Gaussian noise ($\mathbf{x}_T^{\mathrm{Gauss}}$) which uses all dimensions equally, noise maps ($\mathbf{x}_T^{\times{8}}$,$\mathbf{x}_T^{\times{16}}$, $\mathbf{x}_T^{\times{32}}$) exhibit dimensionality collapse, concentrating variance in a sparse set of components.
  • Figure 4: High-Frequency & Low-Frequency Scores vs. Correlation We observe a strong positive correlation between $S_{\mathrm{HF}}$ and $S_{\mathrm{DC}}$, confirming that latent decorrelation is a prerequisite for high-frequency synthesis. In contrast, while non-Gaussian latents naturally preserve input structure (high $S_{\mathrm{LF}}$), enforcing Gaussianity via stochastic injection forces a trade-off that degrades structural fidelity.
  • Figure 5: Inversion trajectory corrected with Orthogonal Variance Guidance At each timestep $t$, we compute gradients for variance ($\mathbf{g}_{\mathrm{HF}}$) and structure ($\mathbf{g}_{\mathrm{LF}}$) and project them into mutually orthogonal subspaces if conflicting to guide the ODE trajectory.
  • ...and 15 more figures