Table of Contents
Fetching ...

Generative diffusion model with inverse renormalization group flows

Kanta Masuki, Yuto Ashida

TL;DR

This work reframes generative diffusion by integrating exact renormalization group (RG) flows, creating a diffusion model (RGDM) that generates data in a coarse-to-fine sequence by reversing RG coarse-graining. The forward process applies scale-dependent colored noise guided by a regulator to progressively erase fine-scale details, while the backward process reconstructs high-resolution structure from a Gaussian fixed-point distribution p_GS ∝ exp(-½∫(∇φ)^2). The approach eliminates ad hoc hyperparameter tuning of noise schedules, improves sample efficiency, and delivers state-of-the-art or competitive results in protein structure prediction and image generation, demonstrating robust, multiscale data modeling with RG-inspired dynamics. The theoretical development, including the Polchinski RG equation and the convex-diffusion flow, provides a rigorous bridge between RG theory and practical diffusion modeling, with broad implications for scalable generation across domains.

Abstract

Diffusion models represent a class of generative models that produce data by denoising a sample corrupted by white noise. Despite the success of diffusion models in computer vision, audio synthesis, and point cloud generation, so far they overlook inherent multiscale structures in data and have a slow generation process due to many iteration steps. In physics, the renormalization group offers a fundamental framework for linking different scales and giving an accurate coarse-grained model. Here we introduce a renormalization group-based diffusion model that leverages multiscale nature of data distributions for realizing a high-quality data generation. In the spirit of renormalization group procedures, we define a flow equation that progressively erases data information from fine-scale details to coarse-grained structures. Through reversing the renormalization group flows, our model is able to generate high-quality samples in a coarse-to-fine manner. We validate the versatility of the model through applications to protein structure prediction and image generation. Our model consistently outperforms conventional diffusion models across standard evaluation metrics, enhancing sample quality and/or accelerating sampling speed by an order of magnitude. The proposed method alleviates the need for data-dependent tuning of hyperparameters in the generative diffusion models, showing promise for systematically increasing sample efficiency based on the concept of the renormalization group.

Generative diffusion model with inverse renormalization group flows

TL;DR

This work reframes generative diffusion by integrating exact renormalization group (RG) flows, creating a diffusion model (RGDM) that generates data in a coarse-to-fine sequence by reversing RG coarse-graining. The forward process applies scale-dependent colored noise guided by a regulator to progressively erase fine-scale details, while the backward process reconstructs high-resolution structure from a Gaussian fixed-point distribution p_GS ∝ exp(-½∫(∇φ)^2). The approach eliminates ad hoc hyperparameter tuning of noise schedules, improves sample efficiency, and delivers state-of-the-art or competitive results in protein structure prediction and image generation, demonstrating robust, multiscale data modeling with RG-inspired dynamics. The theoretical development, including the Polchinski RG equation and the convex-diffusion flow, provides a rigorous bridge between RG theory and practical diffusion modeling, with broad implications for scalable generation across domains.

Abstract

Diffusion models represent a class of generative models that produce data by denoising a sample corrupted by white noise. Despite the success of diffusion models in computer vision, audio synthesis, and point cloud generation, so far they overlook inherent multiscale structures in data and have a slow generation process due to many iteration steps. In physics, the renormalization group offers a fundamental framework for linking different scales and giving an accurate coarse-grained model. Here we introduce a renormalization group-based diffusion model that leverages multiscale nature of data distributions for realizing a high-quality data generation. In the spirit of renormalization group procedures, we define a flow equation that progressively erases data information from fine-scale details to coarse-grained structures. Through reversing the renormalization group flows, our model is able to generate high-quality samples in a coarse-to-fine manner. We validate the versatility of the model through applications to protein structure prediction and image generation. Our model consistently outperforms conventional diffusion models across standard evaluation metrics, enhancing sample quality and/or accelerating sampling speed by an order of magnitude. The proposed method alleviates the need for data-dependent tuning of hyperparameters in the generative diffusion models, showing promise for systematically increasing sample efficiency based on the concept of the renormalization group.
Paper Structure (13 sections, 85 equations, 15 figures, 5 tables)

This paper contains 13 sections, 85 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Renormalization group-based diffusion model (RGDM), and its application to protein structure prediction and image generation.a, Schematic illustrating the concept of the RG theory. A complex microscopic model in the UV is coarse-grained into an effective IR model in a lower-dimensional subspace. b, Power law decay of the power spectral densities of natural data, which motivates the functional ansatz \ref{['data_theory']} for modeling a data distribution. c, Field representations of natural data. A protein structure and image can be regarded as a vector field $\vec{\phi}(x)$ on a one- and two-dimensional space, respectively. d, Overview of the RGDM. Starting from a data distribution $p_{\text{data}}$ at a short distance (or equivalently, a UV wavenumber $\Lambda_{\text{UV}}$), the RG process iteratively performs its coarse-graining, during which all the correlations up to a wavenumber scale $\Lambda$ are retained. As the RG scale $\Lambda$ is changed from $\Lambda_{\text{UV}}$ to a value below the IR scale $\Lambda_{\text{IR}}$, the model gradually flows to the Gaussian distribution. Through reversing these flows, the RGDM generates a sample in a coarse-to-fine manner; the insets show typical generation processes. e, RGDM architecture for training. The RGDM learns the colored noise $\xi_t$ whose schedule is judiciously fixed by the RG theory. To realize an efficient generation, we introduce the projection layers before and after the denoising deep neural network (DNN), which remove the integrated-out high wavenumber modes. In protein structure prediction, we embed the information of amino sequences $\mathcal{R}$ into the DNN. f, Typical protein structure generated by the RGDM compared with the ground truth. The total number of generation steps is $T=98$.
  • Figure 2: Cutoff function, noise schedule, and sampling scheme.a, Cutoff function $K_{\Lambda}(k)$ in the exact RG. The value of $K_\Lambda(k)$ continuously alters from one to zero around $|k|\sim\Lambda$, which allows for eliminating higher-wavenumber fluctuations, ensuring the scale-separation property of the RG. b, Noise parameters plotted as a function of the logarithmic RG scale $t$. The noise schedule in the RGDM (left panels) is unambiguously determined from the exact RG theory in such a way that the forward process progressively destroys Fourier modes from high- to low-wavenumber components by adding colored noise to a sample. The scaling $k^2\bar{\beta}_{tk}\simeq 1$ at large $t$ ensures that the model eventually converges to the fixed-point Gaussian distribution $\phi\!\sim\!p_{\text{GS}}\propto\exp(-\frac{1}{2}\int_k k^2|\phi_k|^2)$. The DDPMs (right panels) use the white noise, whose schedule is the same for all the Fourier modes and needs to be heuristically fine-tuned in a data-dependent manner. c, Sampling scheme by the inverse RG flows from $t=T$ to $0$. Due to the scale-separation property in Eq. \ref{['scale_sep']}, the higher wavenumber components $\phi_{t}^>$ are eliminated from the model and obey the Gaussian distribution $p_{\text{GS}}$. In the generation process at a step $t$, one creates $\phi_{t-1}$ by performing the denoising only on the lower-wavenumber modes $\phi_t^<$ (blue color). The Fourier modes in the wavenumber shell (green color) are the newly integrated-out components and sampled from the Gaussian distribution $p_{\text{GS}}$. The modes $\phi_{t-1}^>$ (red color) are not sampled at all as they have been already eliminated from the model.
  • Figure 3: Protein structure prediction by the RGDM.a, Typical protein structures generated by the RGDM and DDPM. We randomly choose proteins from CAMEO targets, which are not included in the training dataset, and predict the protein structure once for each model. The total number of generation steps $T$ is a function of the amino-sequence length $N$ and set to be $T\!=\!T_{0}\!+\!\tau\ln(N/N_0)$ so that the RG scale at the final diffusion step scales as $\Lambda_T\propto N^{-1}$. We use the parameters $T_{0}=80, \tau= 8.17, N_0=32$ in both models. b, Comparisons of the sample quality between the RGDM and DDPM. We assess the quality based on standard evaluation metrics and plot the histograms by generating samples 100 times for each of the proteins in Fig. \ref{['fig3']}a. We note that a lower RMSD indicates the better sample quality, while larger values mean the higher-quality samples for the other metrics. c, Single-structure prediction accuracy evaluated on the CAMEO targets composed of 182 proteins. We make the plots by generating a protein structure once for each target and calculating the medians and means of the metrics across the sampled structures. The red stars indicate the reference values attained by AlphaFold2 (AF2).
  • Figure 4: Image generation by the RGDM. Top (bottom) panels show the results for the CIFAR-10 (FFHQ) dataset. We use CIFAR-10 images with resolution $32\times 32$ and FFHQ images resized to $64\times 64$. a, Distance between the data distribution $p_{\text{data}}$ and an intermediate distribution $p_t$ in the forward diffusion measured by the frechét interception distance (FID) heusel2017aparmar2021. We take the total number of generation steps to be $T=1000$. b, Typical images generated by the RGDM that are unconditionally trained on the CIFAR-10 and FFHQ datasets with $T=200$. c, Sampling quality of the RGDM and DDPM trained on the CIFAR-10 and FFHQ datasets at different total generation steps $T$. Each data point is obtained by evaluating the FID between the training and sampled datasets obtained by generating $5\times 10^4$ image samples. An error bar is negligibly small compared to the marker size in the plots.
  • Figure S1: The effective model of the first two spins can be obtained by decimating the third spin from the original model. In the effective model, there arises an effective interaction $J_{\rm eff}$ between $\sigma_1$ and $\sigma_2$, which is mediated by the third spin in the original model.
  • ...and 10 more figures