Table of Contents
Fetching ...

Fast distance computation of multivariate distributions via nonparanormal transport

Edward Shao, Junyoung Park, Naresh Punjabi, Hui Jiang, Irina Gaynanova

TL;DR

The Nonparanormal Transport (NPT) metric is introduced, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data and maintains a high level of agreement with the Wasserstein distance.

Abstract

With the increasing availability of data objects in the form of probability distributions, there is a growing need for statistical methods tailored to distributional data. Distance measures, especially the pairwise distance matrix between data objects, provide the foundation for a wide range of modern data analysis methods, such as clustering, multidimensional scaling, and distance-based regression, among others. The Wasserstein distance is commonly used with distributional data due to its compelling optimal transport property. However, while the Wasserstein distance can be efficiently computed for univariate distributions, its application to multivariate distributions is limited due to high computational costs. To address these scalability issues, we introduce the Nonparanormal Transport (NPT) metric, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data. Simulation studies demonstrate that NPT maintains a high level of agreement with the Wasserstein distance, while being at least 1000 times faster than its efficient variants when computing a 100-distribution pairwise distance matrix in both 2 and 5 dimensions. We illustrate the utility of NPT through a multidimensional scaling analysis of bivariate oxygen desaturation distributions of 723 individuals with sleep apnea in the Sleep Heart Health Study.

Fast distance computation of multivariate distributions via nonparanormal transport

TL;DR

The Nonparanormal Transport (NPT) metric is introduced, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data and maintains a high level of agreement with the Wasserstein distance.

Abstract

With the increasing availability of data objects in the form of probability distributions, there is a growing need for statistical methods tailored to distributional data. Distance measures, especially the pairwise distance matrix between data objects, provide the foundation for a wide range of modern data analysis methods, such as clustering, multidimensional scaling, and distance-based regression, among others. The Wasserstein distance is commonly used with distributional data due to its compelling optimal transport property. However, while the Wasserstein distance can be efficiently computed for univariate distributions, its application to multivariate distributions is limited due to high computational costs. To address these scalability issues, we introduce the Nonparanormal Transport (NPT) metric, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data. Simulation studies demonstrate that NPT maintains a high level of agreement with the Wasserstein distance, while being at least 1000 times faster than its efficient variants when computing a 100-distribution pairwise distance matrix in both 2 and 5 dimensions. We illustrate the utility of NPT through a multidimensional scaling analysis of bivariate oxygen desaturation distributions of 723 individuals with sleep apnea in the Sleep Heart Health Study.
Paper Structure (17 sections, 1 theorem, 22 equations, 9 figures, 2 tables)

This paper contains 17 sections, 1 theorem, 22 equations, 9 figures, 2 tables.

Key Result

Proposition 2.1

The distance $d_{NPT}$ defines a metric on the space of nonparanormal distributions with finite second moments.

Figures (9)

  • Figure 1: Absolute error comparison for $d=2$. Differences $d^2_{\text{Wass}} - d^2_{\text{method}}$ are evaluated across $N(N-1)/2$ pairwise distances ($N=100$) with all methods based on $n=100$ realizations compared against the Wasserstein distance on 1000 realizations (treated as ground truth).
  • Figure 2: Absolute error comparison for $d=5$. Differences $d^2_{\text{Wass}} - d^2_{\text{method}}$ are evaluated across $N(N-1)/2$ pairwise distances ($N=100$) with all methods based on $n=100$ realizations compared against the Wasserstein distance on 1000 realizations (treated as ground truth).
  • Figure 3: Scatter plots with overlaid kernel density estimates of bivariate measurements (duration and depth) of desaturation events for three SHHS subjects. Values in parentheses denote the corresponding ODI for each subject.
  • Figure 4: Scatterplot comparing distances to the reference Wasserstein distance with correlation coefficients for SHHS data. The horizontal axis represents the Wasserstein distance, while the vertical axis displays the estimated distance for each approximation. Points falling on the dashed diagonal identity line ($y=x$) indicate a perfect match between the approximation and the Wasserstein distance.
  • Figure 5: Scatterplots of the first two coordinates from a 2-dimensional MDS for NPT (top) and Wasserstein (bottom) distance metrics on SHHS data. Hollow circles represent individual distributions and are color-coded by ODI.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 2.1: Nonparanormal Distribution liu2009nonparanormal
  • Definition 2.2: NPT
  • Proposition 2.1
  • proof : Proof of Proposition \ref{['prop:npt-metric']}