Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Alankrita Bhatt; Mukur Gupta; Germain Kolossov; Andrea Montanari

Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Alankrita Bhatt, Mukur Gupta, Germain Kolossov, Andrea Montanari

Abstract

Generating data from discrete distributions is important for a number of application domains including text, tabular data, and genomic data. Several groups have recently used random $k$-satisfiability ($k$-SAT) as a synthetic benchmark for new generative techniques. In this paper, we show that fundamental insights from the theory of random constraint satisfaction problems have observable implications (sometime contradicting intuition) on the behavior of generative techniques on such benchmarks. More precisely, we study the problem of generating a uniformly random solution of a given (random) $k$-SAT or $k$-XORSAT formula. Among other findings, we observe that: $(i)$~Continuous diffusions outperform masked discrete diffusions; $(ii)$~Learned diffusions can match the theoretical `ideal' accuracy; $(iii)$~Smart ordering of the variables can significantly improve accuracy, although not following popular heuristics.

Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Abstract

Generating data from discrete distributions is important for a number of application domains including text, tabular data, and genomic data. Several groups have recently used random

-satisfiability (

-SAT) as a synthetic benchmark for new generative techniques. In this paper, we show that fundamental insights from the theory of random constraint satisfaction problems have observable implications (sometime contradicting intuition) on the behavior of generative techniques on such benchmarks. More precisely, we study the problem of generating a uniformly random solution of a given (random)

-SAT or

-XORSAT formula. Among other findings, we observe that:

~Continuous diffusions outperform masked discrete diffusions;

~Learned diffusions can match the theoretical `ideal' accuracy;

~Smart ordering of the variables can significantly improve accuracy, although not following popular heuristics.

Paper Structure (46 sections, 2 theorems, 53 equations, 15 figures, 1 table, 6 algorithms)

This paper contains 46 sections, 2 theorems, 53 equations, 15 figures, 1 table, 6 algorithms.

Introduction
Setting and background
Main results
Continuous outperform discrete diffusions; NN matches BP
Ideal target accuracy from random CSP theory
Accuracy is improved by intelligent ordering (continuous and discrete)
Approximate uniformity test
Discussion
Model architecture
Architecture for $r = 1$
Architecture for $r \ge 2$
Architecture for $k-$XORSAT
Experimental setup
Denoiser training
Dataset generation.
...and 31 more sections

Key Result

Theorem 1

For any $r\in{\mathbb N}$, $\varepsilon>0$, $\omega\in (0,1)$, let $\hat{m}_{{\rm BP}_{\hbox{\rm\footnotesize c}}(r)}({\boldsymbol Y}_{{\sf B}_G(i,r)})$ be ${\rm BP}_{\hbox{\rm\footnotesize c}}(r)$ estimate of the conditional expectation at node $i\in V$, under the tilted distribution of Eq. eq:Tilt Here expectation is with respect to $G,{\boldsymbol x},{\boldsymbol y}$ as well as $I\sim{\sf Unif}

Figures (15)

Figure 1: 4-SAT ($N=100$): learned NN denoiser (solid lines) vs. BP denoiser (dashed lines). Success rate (probability of generating actual solutions) as a function of clause density $\alpha$ for continuous (blue) and discrete (red) diffusions. Each panel corresponds to a different locality radius for the denoisers $r \in \{2,3,6,9\}$. Success rates are computed over 500 random formulas. The vertical line marks the dynamical phase transition $\alpha_{\hbox{\tiny\rm d}}(k=4) \approx 9.38$.
Figure 2: 4-XORSAT ($N=100$): learned NN denoiser (solid lines) vs. BP denoiser (dashed lines). Success rate as a function of clause density $\alpha$ for continuous (blue) and discrete (red) diffusions. Each panel corresponds to a different locality radius for the denoisers $r \in \{2,3,6,9\}$. Success rates are computed over 500 random formulas. The vertical line corresponds to the dynamical phase transitoion at $\alpha_{\hbox{\tiny\rm d}}(k=4)\approx 0.77228$.
Figure 3: 4-SAT ($N=300$): effect of BP initialization in discrete diffusion. Success rate as a function of clause density $\alpha$ for discrete BP diffusion on random 4-SAT instances with $r=1$ (left) and $r=2$ (right). We compare zero, warm-start, and cavity-based message initialization over 500 random formulas per value of $\alpha$.
Figure 4: $k$-XORSAT ($N=300$) : reversed-leaf (red) vs reversed-degree (black) vs random (blue) decoding ordering. Success rate as a function of clause density $\alpha$ for $k\in\{4,\dots,8\}$ obtained using discrete diffusion with BP denoiser with $r=300$. The two vertical lines are the theoretical thresholds for random ($\alpha_{\hbox{\tiny\rm mask}}$) and optimal ($\alpha_{\hbox{\tiny\rm d}}$) decoding ordering at every value of $k$. Success rates are computed over 500 random formulas.
Figure 5: 4-XORSAT ($N=100$): reversed-leaf (red) vs random (blue) decoding ordering. Success rate as a function of clause density $\alpha$ obtained using discrete diffusion with learned NN denoiser. Each panel corresponds to a denoiser $r \in \{2,3\}$, success rates are computed over 500 random formulas. The two vertical lines are the theoretical thresholds for random ($\alpha_{\hbox{\tiny\rm mask}}$) and optimal ($\alpha_{\hbox{\tiny\rm d}}$) decoding ordering.
...and 10 more figures

Theorems & Definitions (2)

Theorem 1
Theorem 2

Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Abstract

Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems

Authors

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (2)