Table of Contents
Fetching ...

Moment Expansions of the Energy Distance

Ian Langmore

TL;DR

The paper analyzes the squared energy distance ${\mathcal{D}^2}(X,Y)$ in the regime where the distributions are close, showing that the mean difference $\mu$ typically dominates the loss, with covariance differences entering at higher order via an averaged, dimension-dependent term. By expressing ${\mathcal{D}^2}$ through a Fourier-cumulant expansion and introducing a decay scale $\lambda$, the authors derive a leading-moments expansion: a main $O(1/\lambda)$ term proportional to $\|\mu\|^2$ and a secondary $O(1/\lambda^3)$ term involving $\Delta$, $\mu$, and skew cumulants, plus a controlled remainder. They specialize to multivariate Gaussians to obtain explicit forms and demonstrate that off-diagonal covariance contributions are suppressed by a factor of order $1/d$ under spherical symmetry, while the diagonal part contributes at order $O(d^{-1/2})$ to the mean term. The work also contrasts the energy-distance-based gradient with a standard covariance loss via a cosine similarity analysis, showing how dimension and correlation structure influence learning dynamics. Numerical verification across Gaussian and non-Gaussian distributions confirms the leading-moments predictions and highlights the regimes where the theory holds, offering practical guidance for using energy-distance-based losses in high-dimensional learning tasks.

Abstract

The energy distance is used to test distributional equality, and as a loss function in machine learning. While $D^2(X, Y)=0$ only when $X\sim Y$, the sensitivity to different moments is of practical importance. This work considers $D^2(X, Y)$ in the case where the distributions are close. In this regime, $D^2(X, Y)$ is more sensitive to differences in the means $\bar{X}-\bar{Y}$, than differences in the covariances $Δ$. This is due to the structure of the energy distance and is independent of dimension. The sensitivity to on versus off diagonal components of $Δ$ is examined when $X$ and $Y$ are close to isotropic. Here a dimension dependent averaging occurs and, in many cases, off diagonal correlations contribute significantly less. Numerical results verify these relationships hold even when distributional assumptions are not strictly met.

Moment Expansions of the Energy Distance

TL;DR

The paper analyzes the squared energy distance in the regime where the distributions are close, showing that the mean difference typically dominates the loss, with covariance differences entering at higher order via an averaged, dimension-dependent term. By expressing through a Fourier-cumulant expansion and introducing a decay scale , the authors derive a leading-moments expansion: a main term proportional to and a secondary term involving , , and skew cumulants, plus a controlled remainder. They specialize to multivariate Gaussians to obtain explicit forms and demonstrate that off-diagonal covariance contributions are suppressed by a factor of order under spherical symmetry, while the diagonal part contributes at order to the mean term. The work also contrasts the energy-distance-based gradient with a standard covariance loss via a cosine similarity analysis, showing how dimension and correlation structure influence learning dynamics. Numerical verification across Gaussian and non-Gaussian distributions confirms the leading-moments predictions and highlights the regimes where the theory holds, offering practical guidance for using energy-distance-based losses in high-dimensional learning tasks.

Abstract

The energy distance is used to test distributional equality, and as a loss function in machine learning. While only when , the sensitivity to different moments is of practical importance. This work considers in the case where the distributions are close. In this regime, is more sensitive to differences in the means , than differences in the covariances . This is due to the structure of the energy distance and is independent of dimension. The sensitivity to on versus off diagonal components of is examined when and are close to isotropic. Here a dimension dependent averaging occurs and, in many cases, off diagonal correlations contribute significantly less. Numerical results verify these relationships hold even when distributional assumptions are not strictly met.

Paper Structure

This paper contains 10 sections, 4 theorems, 53 equations, 5 figures, 1 table.

Key Result

Proposition 3.1

Given (i)-(iii) of assumptions assumptions:leading-moments, Under spherical symmetry, assumption (iv), we have a more explicit form where measures the alignment between difference of skewness and difference of the mean, since ${\mathbb{E}}\left\{ (X_i-{\bar{X}}_i)^2(X_j-{\bar{X}}_j) \right\}$ is zero if $X$ is symmetric about its mean.

Figures (5)

  • Figure 1: Sweep of skewness parameter: Here we show the probability density resulting from transforming a unit Normal by SinhArcsinh(skew), for skew$\in\left\{ 0, 0.05, 0.1, 0.2 \right\}$. When skew=0, the transformation is the identity, so the density is the same as the unit Normal. As skew increases, the mass is tilted to the right.
  • Figure 2: Gaussian distributions with small perturbations: Here $Y\sim\mathcal{N}(0, I_d)$, and $X\sim\mathcal{N}(\mu, C)$ is a small perturbation of $Y$. We compare sample ${\mathcal{D}^2}(X, Y)$ with the theoretical estimate of \ref{['align:multivariate-normal-leading-moments']}, for $d=16, 32, 64$. Top: Different values of $\mu_1$ lead to different clusters. Bottom: Fixing $\mu_1=0.06$, we see the effect of different covariance only.
  • Figure 3: Gaussian distributions with larger perturbations: Here $Y\sim\mathcal{N}(0, I_d)$, and $X\sim\mathcal{N}(\mu, C)$ is a larger perturbation of $Y$. We compare sample ${\mathcal{D}^2}(X, Y)$ with the theoretical estimate of \ref{['align:multivariate-normal-leading-moments']}, for $d=16, 32, 64$. Top: Different values of $\mu_1$ lead to different clusters. Bottom: Fixing $\mu_1=0.15$, we see the effect of different covariance only. $R^2$ values are lower here than when perturbations were small (figure \ref{['fig:normal-values-small-pert']}).
  • Figure 4: Non-Gaussian distributions with larger perturbations: Here $X$ and $Y$ are transformed Gaussians. Only $d=64$ is shown. We compare sample ${\mathcal{D}^2}(X, Y)$ with the best fit regression as in \ref{['align:distance-regression']}.
  • Figure 5: Non-Gaussian distributions with larger perturbations, fixed $\mu_1$: Here $X$ and $Y$ are transformed Gaussians. Only $d=64$ is shown. We only show $\mu_1=0.023$ so the effect of covariance can be isolated. We compare sample ${\mathcal{D}^2}(X, Y)$ with the best fit regression as in \ref{['align:distance-regression']}.

Theorems & Definitions (7)

  • Proposition 3.1: Taylor expansion
  • Corollary 3.2
  • Lemma 3.3
  • proof : Proof of lemma \ref{['lemma:psik-values']}
  • Lemma 3.4: Spherical integrals
  • proof : Proof of lemma \ref{['lemma:spherical-integrals']}
  • proof : Proof of proposition \ref{['proposition:leading-moments']}