Table of Contents
Fetching ...

A Lanczos-Based Algorithmic Approach for Spike Detection in Large Sample Covariance Matrices

Charbel Abi Younes, Xiucai Ding, Thomas Trogdon

TL;DR

This work tackles estimating the number of spikes in high-dimensional spiked covariance models without computing the full spectrum of the sample covariance matrix $W=YY^*$, where $Y=\Sigma^{1/2}X$. The authors develop a Lanczos-based framework that samples random directions on the sphere to form eigenvector spectral distributions (VESDs) and leverages a fixed-point Stieltjes transform expressed via a continued fraction derived from the Jacobi and Cholesky structure; this yields a robust estimator for the spiked spectral distribution and, in particular, the spike count by counting poles beyond the ASD support. They establish consistency and concentration results for the estimators under a random-matrix local-law regime, showing $O(N^{-1/2})$-level accuracy for edge estimates and pole locations with $n=O(\\log N)$ Lanczos steps, and demonstrate computational efficiency on large-scale problems. Numerically, the method achieves accurate ASD density estimation and spike detection with substantial speedups over eigenvalue-based approaches, and it remains robust to various population covariances, making it attractive for large-scale high-dimensional inference. The work thus provides a scalable, theory-backed alternative for spike detection in modern data applications where eigen-decomposition is prohibitive.

Abstract

We introduce a new approach for estimating the number of spikes in a general class of spiked covariance models without directly computing the eigenvalues of the sample covariance matrix. This approach is based on the Lanczos algorithm and the asymptotic properties of the associated Jacobi matrix and its Cholesky factorization. A key aspect of the analysis is interpreting the eigenvector spectral distribution as a perturbation of its asymptotic counterpart. The specific exponential-type asymptotics of the Jacobi matrix enables an efficient approximation of the Stieltjes transform of the asymptotic spectral distribution via a finite continued fraction. As a consequence, we also obtain estimates for the density of the asymptotic distribution and the location of outliers. We provide consistency guarantees for our proposed estimators, proving their convergence in the high-dimensional regime. We demonstrate that, when applied to standard spiked covariance models, our approach outperforms existing methods in computational efficiency and runtime, while still maintaining robustness to exotic population covariances.

A Lanczos-Based Algorithmic Approach for Spike Detection in Large Sample Covariance Matrices

TL;DR

This work tackles estimating the number of spikes in high-dimensional spiked covariance models without computing the full spectrum of the sample covariance matrix , where . The authors develop a Lanczos-based framework that samples random directions on the sphere to form eigenvector spectral distributions (VESDs) and leverages a fixed-point Stieltjes transform expressed via a continued fraction derived from the Jacobi and Cholesky structure; this yields a robust estimator for the spiked spectral distribution and, in particular, the spike count by counting poles beyond the ASD support. They establish consistency and concentration results for the estimators under a random-matrix local-law regime, showing -level accuracy for edge estimates and pole locations with Lanczos steps, and demonstrate computational efficiency on large-scale problems. Numerically, the method achieves accurate ASD density estimation and spike detection with substantial speedups over eigenvalue-based approaches, and it remains robust to various population covariances, making it attractive for large-scale high-dimensional inference. The work thus provides a scalable, theory-backed alternative for spike detection in modern data applications where eigen-decomposition is prohibitive.

Abstract

We introduce a new approach for estimating the number of spikes in a general class of spiked covariance models without directly computing the eigenvalues of the sample covariance matrix. This approach is based on the Lanczos algorithm and the asymptotic properties of the associated Jacobi matrix and its Cholesky factorization. A key aspect of the analysis is interpreting the eigenvector spectral distribution as a perturbation of its asymptotic counterpart. The specific exponential-type asymptotics of the Jacobi matrix enables an efficient approximation of the Stieltjes transform of the asymptotic spectral distribution via a finite continued fraction. As a consequence, we also obtain estimates for the density of the asymptotic distribution and the location of outliers. We provide consistency guarantees for our proposed estimators, proving their convergence in the high-dimensional regime. We demonstrate that, when applied to standard spiked covariance models, our approach outperforms existing methods in computational efficiency and runtime, while still maintaining robustness to exotic population covariances.

Paper Structure

This paper contains 29 sections, 22 theorems, 186 equations, 8 figures, 3 tables, 10 algorithms.

Key Result

Theorem 1

Let $\widehat{m}_0(z)$ be the estimate after running our algorithm alg for $n = \mathrm{O}(\log N)$ iterations. Moreover, let $\widehat{\gamma}_{\pm}$ be estimates of the support endpoints and $\widehat{\gamma}_j$ ($j = 1, 2, \dots, \widehat{r}$) be the poles of $\widehat{m}_0(z)$ for $z > \widehat{

Figures (8)

  • Figure 1: Left: The ESD of $\Sigma$, where $\Sigma$ is an $N \times N$ diagonal matrix with $N = 6000$ and entries given by the quantiles of the density $\frac{1}{K}\frac{x^4+1}{x^2} \sqrt{x - 0.1} \sqrt{4 - x}$ with $K$ being a normalizing constant. The first three diagonal entries are modified to 7, 6, and 6, forming the spikes of $\Sigma$. Right: The ESD of $W$ defined in \ref{['eq:SCM_Model']}, with $c_N = 0.1$, $M = 60000$, and $X$ having iid standard normal entries. The ESD is compared against the estimated outliers, their locations, and the approximate ASD obtained using Algorithms \ref{['Ea:ESD']} and \ref{['finaldetectionalgorithm']} with parameters $k = 150$ and $C = 1$.
  • Figure 2: VESD of the sample covariance matrix in Example \ref{['Ex:StdSpikedCov']} for $\mathbf{b} = \mathbf{e}_1$ with $N = 10000$, $M = 20000$, and $c = 0.5$, compared to the VASD from \ref{['eq:StdSpikedCovVASD']} for different values of $\ell$.
  • Figure 3: The rows correspond to $c = 0.1$, $c = 0.5$, and $c = 0.9$, respectively. Left: Estimated outliers and ASD obtained using Algorithms \ref{['Ea:ESD']} and \ref{['finaldetectionalgorithm']} with $k = 100$ vectors. These estimates are compared to the ESD of $W$ from Simulation \ref{['sim:Johnstone']} and the MP density given in \ref{['eq:MPdens']}. Right: Estimated outlier locations and the corresponding error between the estimates and the true outliers of $W$.
  • Figure 4: Comparison of the estimated support and density from Algorithm \ref{['Ea:ESD']} with $k=100$ against the true counterparts from the MP law. The plots illustrate the convergence of errors in Simulation \ref{['sim:Johnstone']} for $c = 0.1$, $0.5$, and $0.9$ as $N$ increases. Errors are averaged over 50 trials, with each point representing the mean error and vertical error bars indicating the standard deviation.
  • Figure 5: Left: Spike detection accuracy in Simulation \ref{['sim:Johnstone']} for $c = 0.1$, $0.5$, and $0.9$ as $N$ varies, based on 50 sample realizations with the algorithm run using $k=1$. Right: Comparison of the average computation time for eigenvalue calculation and spike detection across the samples for $c=0.5$.
  • ...and 3 more figures

Theorems & Definitions (48)

  • Theorem : Informal
  • Theorem 3.1
  • proof
  • Lemma 3.2
  • Remark 3.3
  • Example 1
  • Example 2
  • Definition 4.1: Stochastic domination
  • Remark 4.2
  • Lemma 4.3
  • ...and 38 more