Consistent Estimation of a Class of Distances Between Covariance Matrices
Roberto Pereira, Xavier Mestre, Davig Gregoratti
TL;DR
This paper tackles the problem of consistently estimating distances between covariance matrices directly from high-dimensional data by introducing a broad class of distances $d_M(1,2)=\sum_{l=1}^L \frac{1}{M}\mathrm{tr}[ f_1^{(l)}(\mathbf{R}_1) f_2^{(l)}(\mathbf{R}_2) ]$ that includes the Euclidean, log-Euclidean, and symmetrized KL distances. It develops a resolvent- and contour-based estimator $\hat{d}_M(1,2)$ that remains consistent as $M,N_j\to\infty$ with $c_j = M/N_j$ in various regimes, and proves a central limit theorem for the vector of distances with explicit asymptotic means and variances. The paper provides closed-form estimators for the Euclidean distance, KL divergence, and a computable LE distance, along with simplified single-integral expressions for their variances, enabling practical statistical inference. Numerical experiments show these consistent estimators outperform plug-in distances in high-dimensional settings and enable reliable clustering analyses based on covariance structure.
Abstract
This work considers the problem of estimating the distance between two covariance matrices directly from the data. Particularly, we are interested in the family of distances that can be expressed as sums of traces of functions that are separately applied to each covariance matrix. This family of distances is particularly useful as it takes into consideration the fact that covariance matrices lie in the Riemannian manifold of positive definite matrices, thereby including a variety of commonly used metrics, such as the Euclidean distance, Jeffreys' divergence, and the log-Euclidean distance. Moreover, a statistical analysis of the asymptotic behavior of this class of distance estimators has also been conducted. Specifically, we present a central limit theorem that establishes the asymptotic Gaussianity of these estimators and provides closed form expressions for the corresponding means and variances. Empirical evaluations demonstrate the superiority of our proposed consistent estimator over conventional plug-in estimators in multivariate analytical contexts. Additionally, the central limit theorem derived in this study provides a robust statistical framework to assess of accuracy of these estimators.
