Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

Davide Ghio; Yatin Dandi; Florent Krzakala; Lenka Zdeborová

Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

Davide Ghio, Yatin Dandi, Florent Krzakala, Lenka Zdeborová

TL;DR

This work analyzes the sampling performance of flow-based, diffusion-based, and autoregressive neural samplers on spin-glass and constraint-satisfaction distributions with known ground truth. By mapping sampling to Bayes-optimal denoising of a tilted or pinned measure, the authors derive phase diagrams that reveal first-order transitions along the denoising path as principal obstacles to efficient sampling. They show that for random-first-order-transition (RFOT) models, generative samplers can be inefficient in regions where MCMC performs well, while in certain inference problems with a computational-hard phase, diffusion and related methods can outperform traditional samplers. The results offer a principled framework to evaluate when these modern generative approaches succeed or fail and indicate directions for mitigating hard-phase constraints, such as optimizing interpolants or leveraging over-parameterization. Overall, the paper provides a rigorous, physics-informed lens on the limitations and potential of flow/diffusion/autoregressive samplers in high-dimensional Bayesian sampling tasks.

Abstract

Recent years witnessed the development of powerful generative models based on flows, diffusion or autoregressive neural networks, achieving remarkable success in generating data from examples with applications in a broad range of areas. A theoretical analysis of the performance and understanding of the limitations of these methods remain, however, challenging. In this paper, we undertake a step in this direction by analysing the efficiency of sampling by these methods on a class of problems with a known probability distribution and comparing it with the sampling performance of more traditional methods such as the Monte Carlo Markov chain and Langevin dynamics. We focus on a class of probability distribution widely studied in the statistical physics of disordered systems that relate to spin glasses, statistical inference and constraint satisfaction problems. We leverage the fact that sampling via flow-based, diffusion-based or autoregressive networks methods can be equivalently mapped to the analysis of a Bayes optimal denoising of a modified probability measure. Our findings demonstrate that these methods encounter difficulties in sampling stemming from the presence of a first-order phase transition along the algorithm's denoising path. Our conclusions go both ways: we identify regions of parameters where these methods are unable to sample efficiently, while that is possible using standard Monte Carlo or Langevin approaches. We also identify regions where the opposite happens: standard approaches are inefficient while the discussed generative methods work well.

Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

TL;DR

Abstract

Paper Structure (40 sections, 7 theorems, 116 equations, 12 figures, 2 algorithms)

This paper contains 40 sections, 7 theorems, 116 equations, 12 figures, 2 algorithms.

Introduction
Further related works
Sampling with flow and diffusion-based models
Diffusion-based models and SDEs
The posterior average and Bayes-optimal denoising
Autoregressive networks and ancestral sampling
Properties of Bayesian-optimal denoising
Prototypical exactly analysable models
Phase diagrams of the tilted and pinning measures
Failure of generative models while MCMC succeeds
Other models
Outperforming MCMC in inference models with a hard phase
Computing free entropies: the planting trick
Asymptotic solutions and Phase Diagrams
Sparse rank-one matrix factorization
...and 25 more sections

Key Result

Theorem 1

Let $X^{(1)}, \ldots , X^{(k)}$ be $k$ i.i.d. samples (given $Y$) from the distribution $P \left( X = \cdot \,\middle|\, Y \right)$. Denoting $\left\langle \cdot \right\rangle$ the "Boltzmann" expectation, that is the average with respect to the $P \left( X = \cdot \,\middle|\, Y \right)

Figures (12)

Figure 1: Schematic summary of the comparison of the efficiency of sampling with flow-based, diffusion-based and autoregressive methods versus Langevin or Monte-Carlo approaches in spin glass models with a random first-order transition (top), and in statistical inference models with a computationally hard phase (bottom). The computationally hard phase in inference problems appears for For $\Delta_{\rm alg} < \Delta < \Delta_{\rm IT}$ where efficient algorithms achieving close-to-optimal estimation error are not known and conjectured not to exist gamarnik2022disordered.
Figure 2: Phase diagrams for the tilted measure $P_\gamma$ (top), and the pinning measure $P_\theta$ (bottom): (a) The spherical $p$-spin model, b) the Ising $p$-spin model, c) the sparse rank-one matrix estimation d) the bicoloring problem on hypergraphs (NAE-SAT). The x-axis is the temperature $T=1/\beta$ in (a) and (b), the inverse-SNR $\Delta/\rho^2$ in (c) and the clauses-to-variables ratio $\alpha$ in (d), while the y-axis shows the SNR ratio $\gamma^2 = \alpha^2/\beta^2$ (top) and the decimated ratio $\theta$ (bottom). In the green phase there is a single maxima to the free entropy. The red and orange regions display a phase coexistence with two maxima. In the red region efficient denoising is predicted to be algorithmically hard. In (a) the spherical $p$-spin at $\gamma=\theta=0$ the dynamical threshold $T_d=\sqrt{3/8}$, the Kauzmann transition $T_{\rm K}\approx 0.58$, while the tri-critical point is at $T_{\rm tri}=2/3$ for diffusion and $T_{\rm tri}=\sqrt{1/2}$ for autoregressive. In (b) the Ising $p$-spin the values are: $T_d\approx0.682$, $T_{\rm K}\approx 0.652$, $T_{\rm tri}\approx0.741$ for diffusion, and $T_{\rm tri}\approx0.759$ for autoregressive. In (c) for the sparse rank-one matrix estimation at $\rho=0.08$ we have $\Delta_d/\rho^2 \approx 1.041$, $\Delta_{\rm K}/\rho^2 \approx 1.029$, and $\Delta_{\rm alg}/\rho^2 \approx 0.981$. The tri-critical points are at $\Delta_{\rm tri}/\rho^2\approx 1.08$ for diffusion, and $\Delta_{\rm tri}/\rho^2\approx 1.069$ for autoregressive. In (d) the bicoloring the values are $\alpha_d \approx 9.465$, $\alpha_{\rm K} \approx 10.3$. The tri-critical points are $\alpha_{\rm tri}\approx 8.4$ for both diffusion and autoregressive. The curves for bicoloring were obtained by a polynomial fit, while in all the other cases we represent directly the data points.
Figure 3: Phase diagrams for flow-based sampling (left) and autoregressive-based sampling (right) for the sparse rank-one model, with Rademacher-Bernoulli prior and sparsity $\rho=0.08$. On the x-axis we put the rescaled signal-to-noise-ratio $\Delta/\rho^2$ and on the y-axis the ratio $\gamma^2 = \alpha^2/\beta^2$ (left) and the decimated ratio $\theta$ (right). We compute the order parameter $\chi/\rho$, defined in (\ref{['eq:Chi-app']}), both from an uninformed and an informed initialization, and we plot the difference between the two. The dashed white lines are the spinodal lines, while the dashed black one is the IT threshold. Note that for the flow-based case (left panel), we show explicitly in Fig. \ref{['fig:CUT-1.05']} and Fig. \ref{['fig:CUT-0.98']} the behaviour of the free entropy functional for $\Delta/\rho^2=1.05$ and $\Delta/\rho^2=0.98$. Here in both plots (left and right) we have that the dynamical transition is at $\Delta_d/\rho^2\approx 1.041$, the IT/Kauzmann transition is at $\Delta_{IT}/\rho^2 \approx 1.029$, while the tri-critical points are at $\Delta_{\rm tri}/\rho^2\approx1.08$ for flow-based and $\Delta_{\rm tri}/\rho^2\approx1.069$ for autoregressive based sampling.
Figure 4: The free entropy function $\Phi_{\rm RS}(\chi)$ in the region where the flow-based model fails because of the jump around panel(c).
Figure 5: The free entropy function $\Phi_{\rm RS}(\chi)$ in the region where the flow-based model succeeds as the position of the maxima is always unique.
...and 7 more figures

Theorems & Definitions (14)

Theorem 1: Nishimori Identity
proof
Lemma 1: First(I-MMSE theoremguo2005mutual) and second (FDT theorem kubo1966fluctuation) derivative of the free entropy
proof
Theorem 2: Concentration of overlaps
proof
Lemma 2
proof
Lemma 3
proof
...and 4 more

Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

TL;DR

Abstract

Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (14)