Table of Contents
Fetching ...

Discrete distributions are learnable from metastable samples

Abhijith Jayakumar, Andrey Y. Lokhov, Sidhant Misra, Marc Vuffray

TL;DR

The paper tackles learning the stationary distribution of a high-dimensional discrete system when observed data come from metastable regions where a Markov chain mixes slowly. It introduces two metastability notions, including $\\eta$-strong metastability, and proves that metastable states have single-variable conditionals that are close to those of the true stationary distribution in average TV distance. Leveraging this, the authors show that conditional likelihood-based learning, particularly pseudo-likelihood, can recover near-optimal estimates of the energy function and model parameters from metastable data, with explicit bounds that depend on the chain's conductance and flip-bound parameters. They provide concrete results for Ising models and demonstrate the approach numerically on the Curie-Weiss model, where PL learns the correct parameters despite data being drawn from metastable samples, whereas MLE fails. The work bridges statistical physics and learning theory, enabling robust learning in slow-mixing regimes and suggesting extensions to broader energy-based models and neural parametrizations.

Abstract

Physically motivated stochastic dynamics are often used to sample from high-dimensional distributions. However such dynamics often get stuck in specific regions of their state space and mix very slowly to the desired stationary state. This causes such systems to approximately sample from a metastable distribution which is usually quite different from the desired, stationary distribution of the dynamic. We rigorously show that, in the case of multi-variable discrete distributions, the true model describing the stationary distribution can be recovered from samples produced from a metastable distribution under minimal assumptions about the system. This follows from a fundamental observation that the single-variable conditionals of metastable distributions that satisfy a strong metastability condition are on average close to those of the stationary distribution. This holds even when the metastable distribution differs considerably from the true model in terms of global metrics like Kullback-Leibler divergence or total variation distance. This property allows us to learn the true model using a conditional likelihood based estimator, even when the samples come from a metastable distribution concentrated in a small region of the state space. Explicit examples of such metastable states can be constructed from regions that effectively bottleneck the probability flow and cause poor mixing of the Markov chain. For specific cases of binary pairwise undirected graphical models (i.e. Ising models), we extend our results to further rigorously show that data coming from metastable states can be used to learn the parameters of the energy function and recover the structure of the model.

Discrete distributions are learnable from metastable samples

TL;DR

The paper tackles learning the stationary distribution of a high-dimensional discrete system when observed data come from metastable regions where a Markov chain mixes slowly. It introduces two metastability notions, including $\\eta$-strong metastability, and proves that metastable states have single-variable conditionals that are close to those of the true stationary distribution in average TV distance. Leveraging this, the authors show that conditional likelihood-based learning, particularly pseudo-likelihood, can recover near-optimal estimates of the energy function and model parameters from metastable data, with explicit bounds that depend on the chain's conductance and flip-bound parameters. They provide concrete results for Ising models and demonstrate the approach numerically on the Curie-Weiss model, where PL learns the correct parameters despite data being drawn from metastable samples, whereas MLE fails. The work bridges statistical physics and learning theory, enabling robust learning in slow-mixing regimes and suggesting extensions to broader energy-based models and neural parametrizations.

Abstract

Physically motivated stochastic dynamics are often used to sample from high-dimensional distributions. However such dynamics often get stuck in specific regions of their state space and mix very slowly to the desired stationary state. This causes such systems to approximately sample from a metastable distribution which is usually quite different from the desired, stationary distribution of the dynamic. We rigorously show that, in the case of multi-variable discrete distributions, the true model describing the stationary distribution can be recovered from samples produced from a metastable distribution under minimal assumptions about the system. This follows from a fundamental observation that the single-variable conditionals of metastable distributions that satisfy a strong metastability condition are on average close to those of the stationary distribution. This holds even when the metastable distribution differs considerably from the true model in terms of global metrics like Kullback-Leibler divergence or total variation distance. This property allows us to learn the true model using a conditional likelihood based estimator, even when the samples come from a metastable distribution concentrated in a small region of the state space. Explicit examples of such metastable states can be constructed from regions that effectively bottleneck the probability flow and cause poor mixing of the Markov chain. For specific cases of binary pairwise undirected graphical models (i.e. Ising models), we extend our results to further rigorously show that data coming from metastable states can be used to learn the parameters of the energy function and recover the structure of the model.

Paper Structure

This paper contains 31 sections, 21 theorems, 182 equations, 5 figures.

Key Result

Proposition 1

Starting from an $\eta-$ metastable distribution it takes at least $\frac{|\mu - \nu|_{ TV} - \epsilon}{\eta}$ number of steps to get $\epsilon$ close to the equilibrium distribution $\mu$ in TV.

Figures (5)

  • Figure 1: An informal representation of our result given by Theorem 1. Samples coming from a metastable distribution of reversible Markov chain samplers are far from the full measure in global metrics. Surprisingly, at the same time we show that single-variable conditionals in metastable distributions are on average close to those of the true distribution. In Curie-Weiss model, we will use an an explicit construction to demonstrate that such metastable states correspond to the local minima of the free energy, which agrees with an intuitive statistical physics picture of metastability.
  • Figure 2: Strongly metastable states in the CW model. Here we plot the violation in detailed balance condition as defined in \ref{['eq:strong_def']} computed by projecting to the magnetization space as in \ref{['eq:mag_reduction']}. (a) Fourth-order approximation to the free energy at $m_0$ (b) Second-order approximation (c) Truncated free energy as defined in \ref{['eq:truncated_def']}
  • Figure 3: (a) Error in learning the Curie-Weiss model on 5000 spins. Samples here are produced by Glauber dynamics "stuck" at the positive minima of the free energy. True parameters here are $J = 1.2, h = 0.04.$ (b) The true distribution is highly biased towards towards negative magnetization as seen by free energy curve. There is a metastable distribution with positive magnetization that is highly suppressed in terms of probability. The empirical distributions of samples ($M=4\times10^9$) drawn by an exact sampler and Glauber dynamics is overlaid on top of the free energy. This shows that the Markov chain is effectively stuck around the positive minima.
  • Figure 4: Comparison of the loss function landscape for the CW model with true parameters $J = 1.2, h = 0.04.$ These are plotted with $M= 2^{32}$ samples produced by Glauber dynamics "stuck" at the positive minima of the free energy. (a) Negative log-likelihood computed from this data clearly has it's minimum far from the true model. The sign of the magnetization is opposite of the true model. This is expected as maximum likelihood tries to match the sufficient statistics of the data to the model. (b) PL loss function has the minima close to the true model and learns the magnetic field with the right sign.
  • Figure 5: (a) Error in learning from the model in \ref{['eq:three_body_ferro']} with $\beta=1.4$.(a) The average energy of the samples produced by Glauber dynamics is much higher that the true average energy of the model. To compute the average energy of the model for $n>20$ we linearly interpolate from the exact sampling results. $M=10^6$ for these experiments. This implies that the Markov chain is stuck in a metastable distribution (b) The maximum error in the learned energy function parameters. Learning is done from samples produced by the Markov chain.

Theorems & Definitions (36)

  • Definition 1: Metastability
  • Proposition 1
  • Definition 2: Strong Metastability
  • Proposition 2
  • Definition 3: Conductance
  • Lemma 1: Cheeger bound jerrum1988conductance
  • Theorem 1: Conditionals of strong metastable distributions
  • Corollary 1
  • Theorem 2: Test error for logistic regression with metastable samples
  • Theorem 3: Learning pair-wise couplings
  • ...and 26 more