Table of Contents
Fetching ...

Statistical Efficiency of Score Matching: The View from Isoperimetry

Frederic Koehler, Alexander Heckett, Andrej Risteski

TL;DR

This work rigorously analyzes the statistical efficiency of score matching for energy-based models, linking its performance relative to maximum likelihood to geometric-analytic quantities such as the Poincaré, log-Sobolev, and isoperimetric constants. It provides nonasymptotic KL guarantees and asymptotic normality results, showing that in smooth, well-behaved distributions score matching can be nearly as efficient as MLE, while in distributions with strong isoperimetric barriers or sparse cuts it can be substantially less efficient. The authors also establish discrete analogues via Glauber dynamics, pseudolikelihood, and ratio matching, and validate the theory through simulations, including neural-network parameterizations. Overall, the paper offers a unified framework connecting statistical efficiency, functional inequalities, and mixing times, with practical implications for training EBMs and designing annealing strategies.

Abstract

Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincaré, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.

Statistical Efficiency of Score Matching: The View from Isoperimetry

TL;DR

This work rigorously analyzes the statistical efficiency of score matching for energy-based models, linking its performance relative to maximum likelihood to geometric-analytic quantities such as the Poincaré, log-Sobolev, and isoperimetric constants. It provides nonasymptotic KL guarantees and asymptotic normality results, showing that in smooth, well-behaved distributions score matching can be nearly as efficient as MLE, while in distributions with strong isoperimetric barriers or sparse cuts it can be substantially less efficient. The authors also establish discrete analogues via Glauber dynamics, pseudolikelihood, and ratio matching, and validate the theory through simulations, including neural-network parameterizations. Overall, the paper offers a unified framework connecting statistical efficiency, functional inequalities, and mixing times, with practical implications for training EBMs and designing annealing strategies.

Abstract

Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood for the training data, we instead fit the score function -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincaré, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.
Paper Structure (33 sections, 19 theorems, 123 equations, 5 figures)

This paper contains 33 sections, 19 theorems, 123 equations, 5 figures.

Key Result

Proposition 1

The log-Sobolev inequality for $q$ is equivalent to the following inequality over all smooth probability densities $p$: More generally, for a class of distribution $p \in \mathcal{P}$ the restricted log-Sobolev constant is the smallest constant such that $\mathop{\bf KL}(p,q) \le C_{LS}(q,\mathcal{P}) (J_p(q) - J_p(p))$ for all distributions $p$.

Figures (5)

  • Figure 1: Statistical efficiency of score matching vs MLE for fitting the distribution with ground truth parameters $(\theta_0, \theta_1) = (1,0)$ of the form $p_{\theta} (x) \propto e^{\theta_0 (x^2 - x^4 / (2 a^2)) + \theta_1 (x^2 - x^4 / (2 a^2) + \text{erf}(x))}$ as we vary the offset $a$ between 1 and 7 and train with fixed number of samples ($10^5$). We see score matching (red) performs very poorly compared to the MLE (blue) as the offset (distance between modes) grows, by plotting the log of the Euclidean distance to the true parameter for both estimators.
  • Figure 2: Level sets for the distribution over estimates in the same example as Figure \ref{['fig:bimodal']}. We see that as the distance $a$ between modes increases, the direction of large variance for the score matching estimator (right figure) corresponds to the difference of the sufficient statistics which encodes the sparse cut in the distribution. On the other hand, the MLE (left figure) does not exhibit this behavior and has low variance in all directions.
  • Figure 3: Here we see the result of running an identical experiment to Figure \ref{['fig:bimodal']}, only we remove the second sufficient statistic, so our distribution is now $p_{\theta} (x) \propto e^{\theta_0 (x^2 - x^4 / (2 a^2))}$ where $\theta_0 = 1$ and we again vary the offset $a$ between 1 and 7. With only the single sufficient statistic, score matching performs comparably to MLE.
  • Figure 4: Training a single hidden-layer network to score match a mixture of Gaussians (ground truth orange, score matching output blue) succeeds at learning the distribution when the modes are close (left, small isoperimetric constant), but not when they are distant (right, large isoperimetric constant) in which case it weighs the modes incorrectly.
  • Figure 5: Score matching vs MLE for a distribution with a rapidly oscillating sufficient statistic, $p_{\theta} (x) \propto e^{- \theta_0 x^2 / 2 - \theta_1 \sin(\omega x)}$ where $(\theta_0, \theta_1) = (1, 1)$, and increasing $\omega$. On the top, for increasing $\omega$ we show a log-log plot of the average Euclidean distance in parameter space between $\theta$ and the output of each estimator. On the bottom, for each value of $\omega$, we draw a level set of the distribution within which a fixed fraction of returned estimates lie (MLE left, score matching right). Score matching becomes increasingly inaccurate as $\omega$ increases while the MLE stays extremely accurate.

Theorems & Definitions (49)

  • Definition 1: Score matching
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 1
  • proof
  • Remark 1: Interpretation of Score Matching
  • Remark 2
  • Theorem 1
  • proof
  • ...and 39 more