Table of Contents
Fetching ...

Fast Wasserstein rates for estimating probability distributions of probabilistic graphical models

Daniel Bartl, Stephan Eckstein

TL;DR

This work analyzes nonparametric distribution estimation in Wasserstein distance under known probabilistic graphical models. By quantifying smoothness of the conditional kernels via Wasserstein-Lipschitz and TV-Lipschitz conditions, it shows that estimation rates depend on local graph structure through the local dimension $d_{\rm loc}$ rather than the ambient dimension $d$. It introduces tractable estimators that achieve (and, for TV-Lipschitz, are sharp) minimax rates $\lesssim n^{-1/d_{\rm loc}}$ (WLip) or $\lesssim n^{-2/(2+d_{\rm loc})} + n^{-1/d_{\max}}$ (TV-Lip), with additional log factors in boundary cases. The results highlight when graph-based biases help accelerate learning and establish fundamental limits when continuity is absent, informing both theory and practice for structured nonparametric estimation in graphical models.

Abstract

Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance is a fundamental instance of the curse of dimensionality. We explore how structural knowledge about the data-generating process which gives rise to the distribution can be used to overcome this curse. More precisely, we work with the set of distributions of probabilistic graphical models for a known directed acyclic graph. It turns out that this knowledge is only helpful if it can be quantified, which we formalize via smoothness conditions on the transition kernels in the disintegration corresponding to the graph. In this case, we prove that the rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes. The precise rate depends on the exact notion of smoothness assumed for the kernels, where either weak (Wasserstein-Lipschitz) or strong (bidirectional Total-Variation-Lipschitz) conditions lead to different results. We prove sharpness under the strong condition and show that this condition is satisfied for example for distributions having a positive Lipschitz density.

Fast Wasserstein rates for estimating probability distributions of probabilistic graphical models

TL;DR

This work analyzes nonparametric distribution estimation in Wasserstein distance under known probabilistic graphical models. By quantifying smoothness of the conditional kernels via Wasserstein-Lipschitz and TV-Lipschitz conditions, it shows that estimation rates depend on local graph structure through the local dimension rather than the ambient dimension . It introduces tractable estimators that achieve (and, for TV-Lipschitz, are sharp) minimax rates (WLip) or (TV-Lip), with additional log factors in boundary cases. The results highlight when graph-based biases help accelerate learning and establish fundamental limits when continuity is absent, informing both theory and practice for structured nonparametric estimation in graphical models.

Abstract

Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance is a fundamental instance of the curse of dimensionality. We explore how structural knowledge about the data-generating process which gives rise to the distribution can be used to overcome this curse. More precisely, we work with the set of distributions of probabilistic graphical models for a known directed acyclic graph. It turns out that this knowledge is only helpful if it can be quantified, which we formalize via smoothness conditions on the transition kernels in the disintegration corresponding to the graph. In this case, we prove that the rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes. The precise rate depends on the exact notion of smoothness assumed for the kernels, where either weak (Wasserstein-Lipschitz) or strong (bidirectional Total-Variation-Lipschitz) conditions lead to different results. We prove sharpness under the strong condition and show that this condition is satisfied for example for distributions having a positive Lipschitz density.

Paper Structure

This paper contains 19 sections, 25 theorems, 146 equations, 2 figures.

Key Result

Theorem 1.3

Assume $G$ satisfies Assumption ass:graph_struc, fix $L > 0$ and denote by $\mathcal{Q}$ the set of measures $\mu \in \mathcal{P}_G$ which satisfy Assumption ass:W.Lip with constant $L$. Then, there exists a constant $C$ depending only on $G$, $L$ and $d_{\rm loc}$ such that

Figures (2)

  • Figure 1: Exemplification of the constants occurring in Lemma \ref{['lem:Wdecomp']}. The red numbers indicate the number of outgoing paths of different lengths (e.g., $(2, 1)$ below node 3 indicates that there are 2 paths of length 1, and 1 path of length 2 outgoing). The green numbers indicate how the constants for the cost change in the backward induction of the proof of Lemma \ref{['lem:Wdecomp']}. At the end of the backward induction (bottom right), the red numbers indicate the constants for each node, e.g., $2L+2L^2+L^3$ corresponds to $(2, 2, 1)$ for node 1.
  • Figure 2: Visualization of estimators for a simple graph $1 \rightarrow 2$ with $\mathcal{X}_1 = \mathcal{X}_2 = [0, 1]$ with a partition of each interval into three subsets. We see the support of $\hat{\mu}$ (on the left), the support of $\hat{\mu}^{\mathcal{A}}$ (middle) and the support of $\hat{\mu}^{b\mathcal{A}}$ (right). Hereby, blue crosses are the initial data points, green are the new data points which are added by making the kernels constant in the direction from first to second coordinate, and orange are the new points which are added by further making the kernels constant in the direction from second to first coordinate. Eventually, on the right, we have product measures locally on each cube.

Theorems & Definitions (50)

  • Theorem 1.3
  • proof
  • Theorem 1.5
  • proof
  • Proposition 1.6
  • Definition 2.1
  • Theorem 2.2
  • Lemma 2.3
  • proof
  • Lemma 2.4
  • ...and 40 more