Sampling Binary Data by Denoising through Score Functions
Francis Bach, Saeed Saremi
TL;DR
The paper addresses sampling binary data on the Boolean hypercube by replacing Gaussian smoothing with Bernoulli noise smoothing and extending the Tweedie-Miyasawa score-based denoising framework to binary data. It derives a TMF-like relation where the optimal binary denoiser ties to the score of the noisy distribution through $\mathbb{E}[x|y]=\frac{1}{\alpha}\nabla\log q_\alpha(y)$ and learns these score functions via logistic-denoiser objectives. It then develops both one-stage and two-stage discrete Langevin samplers, provides contraction and stationary-distribution guarantees, and extends to multi-measurement settings to reduce effective noise to $m\alpha$, with empirical validation on synthetic mixtures and binarized MNIST. The approach yields efficient sampling in high-noise regimes, preserves the binary nature of the data, and offers a flexible framework for binary data generation and denoising with rigorous analytic guarantees. Potential extensions include applying to other exponential-family noises, sharper priors, and faster sampling via Metropolis-Hastings refinements.
Abstract
Gaussian smoothing combined with a probabilistic framework for denoising via the empirical Bayes formalism, i.e., the Tweedie-Miyasawa formula (TMF), are the two key ingredients in the success of score-based generative models in Euclidean spaces. Smoothing holds the key for easing the problem of learning and sampling in high dimensions, denoising is needed for recovering the original signal, and TMF ties these together via the score function of noisy data. In this work, we extend this paradigm to the problem of learning and sampling the distribution of binary data on the Boolean hypercube by adopting Bernoulli noise, instead of Gaussian noise, as a smoothing device. We first derive a TMF-like expression for the optimal denoiser for the Hamming loss, where a score function naturally appears. Sampling noisy binary data is then achieved using a Langevin-like sampler which we theoretically analyze for different noise levels. At high Bernoulli noise levels sampling becomes easy, akin to log-concave sampling in Euclidean spaces. In addition, we extend the sequential multi-measurement sampling of Saremi et al. (2024) to the binary setting where we can bring the "effective noise" down by sampling multiple noisy measurements at a fixed noise level, without the need for continuous-time stochastic processes. We validate our formalism and theoretical findings by experiments on synthetic data and binarized images.
