Table of Contents
Fetching ...

Computationally tractable nonparametric bootstrap of high-dimensional sample covariance matrices

Holger Dette, Angelika Rohde

Abstract

We introduce a new ``$(m,mp/n)$ out of $(n,p)$'' sampling-with-replace\-ment bootstrap for eigenvalue statistics of high-dimensional sample covariance matrices based on $n$ independent $p$-dimensional random vectors. As it only uses $q=\lfloor mp/n\rfloor $ coordinates of the observations in a subsample of size $m \ll n $ from the original data, it is computationally tractable for large scale data. In the high-dimensional scenario $p/n\rightarrow c\in (0,\infty)$, this fully nonparametric bootstrap is shown to consistently reproduce the empirical spectral measure if $m/n\rightarrow 0$. If $m^2/n\rightarrow 0$, it approximates correctly the distribution of linear spectral statistics. The crucial component is a suitably defined Representative Subpopulation Condition which is shown to be verified in a large variety of situations. Our proofs are conducted under minimal moment requirements and incorporate delicate results on non-centered quadratic forms, combinatorial trace moments estimates as well as a conditional bootstrap martingale CLT which may be of independent interest.

Computationally tractable nonparametric bootstrap of high-dimensional sample covariance matrices

Abstract

We introduce a new `` out of '' sampling-with-replace\-ment bootstrap for eigenvalue statistics of high-dimensional sample covariance matrices based on independent -dimensional random vectors. As it only uses coordinates of the observations in a subsample of size from the original data, it is computationally tractable for large scale data. In the high-dimensional scenario , this fully nonparametric bootstrap is shown to consistently reproduce the empirical spectral measure if . If , it approximates correctly the distribution of linear spectral statistics. The crucial component is a suitably defined Representative Subpopulation Condition which is shown to be verified in a large variety of situations. Our proofs are conducted under minimal moment requirements and incorporate delicate results on non-centered quadratic forms, combinatorial trace moments estimates as well as a conditional bootstrap martingale CLT which may be of independent interest.

Paper Structure

This paper contains 35 sections, 21 theorems, 459 equations, 5 figures, 3 tables.

Key Result

Theorem 4.1

Grant assumptions (A1) -- (A3). Assume that the Representative Subpopulation Condition def: rsc is satisfied with $q=mp/n$. If $m = o(n)$, then

Figures (5)

  • Figure 1: Left panel: Eigenvalue histogram of an empirical covariance matrix from a bootstrap sample drawn randomly with replacement; Right panel: Eigenvalue histogram of an empirical covariance matrix from the bootstrap sample drawn by the $(m,mp/n)$ out of $(n,p)$ bootstrap proposed in this paper. Solid line (in both panels) the density of the limiting spectral distribution. The sample size is $n=80000$, the dimension $p=40000$ and the population covariance matrix is a diagonal matrix with $50\%$ of the entries equal to $1$ and $50\%$ equal to $2$.
  • Figure 2: Histograms of eigenvalues of the empirical covariance matrix $\widehat{\Sigma}_n$ (upper left panel) and of the empirical covariance matrix $\widehat{\Sigma}_n^*$ obtained by "$(m,mp/n)$ out of $(n,p)$" bootstrap for different choices of $m$ (upper right panel: $m=n/5$; lower left panel: $m=n/10$; lower right panel: $m=n/20$). The sample size is $n=10000$ and the dimension is $p=5000$, and data is generated with the population covariance matrix (a) in \ref{['sim1']}.
  • Figure 3: Histograms of eigenvalues of the empirical covariance matrix $\widehat{\Sigma}_n$ (upper left panel) and of the empirical covariance matrix $\widehat{\Sigma}_n^*$ obtained by "$(m,mp/n)$ out of $(n,p)$" bootstrap for different choices of $m$ (upper right panel: $m=n/5$; lower left panel: $m=n/10$; lower right panel: $m=n/20$). The sample size is $n=10000$ and the dimension is $p=5000$, and data is generated with the population covariance matrix (b) in \ref{['sim1']}.
  • Figure 4: Histograms of eigenvalues of the empirical covariance matrix $\widehat{\Sigma}_n$ (upper left panel) and of the empirical covariance matrix $\widehat{\Sigma}_n^*$ obtained by "$(m,mp/n)$ out of $(n,p)$" bootstrap for different choices of $m$ (upper right panel: $m=n/5$; lower left panel: $m=n/10$; lower right panel: $m=n/20$). The sample size is $n=10000$ and the dimension is $p=5000$, and data is generated with the population covariance matrix (c) in \ref{['sim1']}.
  • Figure 5: Density of the limiting spectral distribution (solid line) and the histogram of the "$(m,mp/n)$ out of $(n,p)$" bootstrap. The sample size is $n=80000$ and the dimension is $p=20000$ (left column), $p=40000$ (middle column) and $p=60000$ (right column), where $m=n/10$. The upper and lower rows correspond to the different scenarios (a) and (b) for the population covariance matrix $\Sigma_n$ in \ref{['sim1']}.

Theorems & Definitions (49)

  • Remark 3.2: Stability under perturbations
  • Example 3.3: Diagonal covariance matrices
  • Example 3.4: Symmetric Toeplitz and block Toeplitz matrices
  • Example 3.5: Representative subpopulations
  • Theorem 4.1: Spectral distribution
  • Remark 4.2
  • Theorem 4.3: Extremal eigenvalues
  • Corollary 4.5
  • Theorem 4.6: Linear spectral statistics
  • Remark 4.7
  • ...and 39 more