Score-Based Generative Models Detect Manifolds
Jakiw Pidstrigach
TL;DR
The paper develops a theoretical framework for score-based generative models by analyzing forward and reverse SDEs and the effects of approximations to the initial distribution and the score drift. It derives conditions under which the generated samples share the same support as the data manifold, thereby addressing memorization versus genuine generalization. A key insight is that drift approximation must be unbounded to achieve generalization, and drift explosion near the terminal time is linked to the manifold structure of the data. The work also provides guidance on choosing priors for p_T and discusses broader implications for the theoretical understanding of SGMs.
Abstract
Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.
