Consistent Estimation of a Class of Distances Between Covariance Matrices

Roberto Pereira; Xavier Mestre; Davig Gregoratti

Consistent Estimation of a Class of Distances Between Covariance Matrices

Roberto Pereira, Xavier Mestre, Davig Gregoratti

TL;DR

This paper tackles the problem of consistently estimating distances between covariance matrices directly from high-dimensional data by introducing a broad class of distances $d_M(1,2)=\sum_{l=1}^L \frac{1}{M}\mathrm{tr}[ f_1^{(l)}(\mathbf{R}_1) f_2^{(l)}(\mathbf{R}_2) ]$ that includes the Euclidean, log-Euclidean, and symmetrized KL distances. It develops a resolvent- and contour-based estimator $\hat{d}_M(1,2)$ that remains consistent as $M,N_j\to\infty$ with $c_j = M/N_j$ in various regimes, and proves a central limit theorem for the vector of distances with explicit asymptotic means and variances. The paper provides closed-form estimators for the Euclidean distance, KL divergence, and a computable LE distance, along with simplified single-integral expressions for their variances, enabling practical statistical inference. Numerical experiments show these consistent estimators outperform plug-in distances in high-dimensional settings and enable reliable clustering analyses based on covariance structure.

Abstract

This work considers the problem of estimating the distance between two covariance matrices directly from the data. Particularly, we are interested in the family of distances that can be expressed as sums of traces of functions that are separately applied to each covariance matrix. This family of distances is particularly useful as it takes into consideration the fact that covariance matrices lie in the Riemannian manifold of positive definite matrices, thereby including a variety of commonly used metrics, such as the Euclidean distance, Jeffreys' divergence, and the log-Euclidean distance. Moreover, a statistical analysis of the asymptotic behavior of this class of distance estimators has also been conducted. Specifically, we present a central limit theorem that establishes the asymptotic Gaussianity of these estimators and provides closed form expressions for the corresponding means and variances. Empirical evaluations demonstrate the superiority of our proposed consistent estimator over conventional plug-in estimators in multivariate analytical contexts. Additionally, the central limit theorem derived in this study provides a robust statistical framework to assess of accuracy of these estimators.

Consistent Estimation of a Class of Distances Between Covariance Matrices

TL;DR

This paper tackles the problem of consistently estimating distances between covariance matrices directly from high-dimensional data by introducing a broad class of distances

that includes the Euclidean, log-Euclidean, and symmetrized KL distances. It develops a resolvent- and contour-based estimator

that remains consistent as

with

in various regimes, and proves a central limit theorem for the vector of distances with explicit asymptotic means and variances. The paper provides closed-form estimators for the Euclidean distance, KL divergence, and a computable LE distance, along with simplified single-integral expressions for their variances, enabling practical statistical inference. Numerical experiments show these consistent estimators outperform plug-in distances in high-dimensional settings and enable reliable clustering analyses based on covariance structure.

Abstract

Paper Structure (29 sections, 6 theorems, 188 equations, 4 figures)

This paper contains 29 sections, 6 theorems, 188 equations, 4 figures.

Introduction
Statistical Model of the Observations and Family of Distances
Consistent estimator of $d_{M}(1,2)$
Consistent Estimator
Estimation of the Euclidean distance
Estimation of the symmetrized KL divergence
Estimation of the Log-Euclidean distance
A central limit theorem on the proposed estimators
Particularization to the Euclidean distance
Particularization to the symmetrized KL divergence
Particularization to the log-Euclidean distance
Numerical Evaluation
Accuracy of the Asymptotic Distribution
Consistent Estimators vs Plug-in Distances
Assessing Clustering Quality
...and 14 more sections

Key Result

Proposition 1

Under (As1)-(As4) we have almost surely. Here, $\hat{h}_{j}^{(l)}(z)$ denotes the random function where $\hat{\omega}_{j}\left( z\right)$ denotes the consistent estimator of $\omega_{j}\left( z\right)$ given by and where $\hat{\omega}_{j}^{\prime}\left( z\right)$ represents its derivative, namely Furthermore, the right hand side of (eq:asymptEqF) has bounded spectral norm with probability o

Figures (4)

Figure 1: Histogram of empirical distribution (in blue) and asymptotic descriptors (in orange) of different metrics EU, KL and LE arranged from top to bottom, respectively, for fixed $\rho_1 = 0.8, \rho_2 = 0.4$ .
Figure 2: Relative MSE related to different metrics in different scenarios (a)-(d) with respect to the growth of $N=N_1=N_2$ ($x$--axis). In all these curves, the system dimension $M$ is scaled proportionally, so that $c = M/N$ is constant.
Figure 3: Empirical (solid lines) and theoretical (dashed lines) probability of correct clustering (y-axis) six sample covariance matrices into three groups for growing $M$ (x-axis) and fixed $\rho_1 = \rho_2=0.3, \rho_3=\rho_4 = 0.5, \rho_5 = \rho_6 = 0.7$ using proposed estimators.
Figure 4: Probability of correct clustering (y-axis) six SCMs into three groups for growing $M$ (x-axis). Results for traditional plug-in estimator are depicted in dashed lines and consistent in solid lines.

Theorems & Definitions (9)

Remark 1
Remark 2
Proposition 1
Theorem 1
Remark 3
Lemma 1
Lemma 2
Lemma 3
Proposition 2

Consistent Estimation of a Class of Distances Between Covariance Matrices

TL;DR

Abstract

Consistent Estimation of a Class of Distances Between Covariance Matrices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (9)