High-Dimensional Canonical Correlation Analysis

Anna Bykhovskaya; Vadim Gorin

High-Dimensional Canonical Correlation Analysis

Anna Bykhovskaya, Vadim Gorin

TL;DR

This paper addresses identifiability in high-dimensional canonical correlation analysis by showing that, when the sample size $S$ and dimensions $K,M$ grow proportionally, the classic sample CCA vectors are not consistent estimators of the population canonical vectors. It derives exact formulas for estimation error in the form of angular cones, with key phase-transition behavior governed by the threshold $\rho^2>\rho_c^2$, where $\rho_c^2=\frac{1}{\sqrt{(\tau_M-1)(\tau_K-1)}}$ and $z_\rho$ quantifies the spike location beyond the bulk Wachter support. The results extend beyond Gaussian data to fourth-moment Gaussian settings, correlated signals/noise, and multiple signals, and are supported by empirical illustrations on financial stock data and ecological grassland data. The practical impact is a concrete procedure to assess the precision of CCA estimates in high-dimensional applications and a framework to detect and quantify shared structure when dimensions dominate samples.

Abstract

This paper studies high-dimensional canonical correlation analysis (CCA) with an emphasis on the vectors that define canonical variables. The paper shows that when two dimensions of data grow to infinity jointly and proportionally, the classical CCA procedure for estimating those vectors fails to deliver a consistent estimate. This provides the first result on the impossibility of identification of canonical variables in the CCA procedure when all dimensions are large. As a countermeasure, the paper derives the magnitude of the estimation error, which can be used in practice to assess the precision of CCA estimates. Applications of the results to cyclical vs. non-cyclical stocks and to a limestone grassland data set are provided.

High-Dimensional Canonical Correlation Analysis

TL;DR

This paper addresses identifiability in high-dimensional canonical correlation analysis by showing that, when the sample size

and dimensions

grow proportionally, the classic sample CCA vectors are not consistent estimators of the population canonical vectors. It derives exact formulas for estimation error in the form of angular cones, with key phase-transition behavior governed by the threshold

, where

and

quantifies the spike location beyond the bulk Wachter support. The results extend beyond Gaussian data to fourth-moment Gaussian settings, correlated signals/noise, and multiple signals, and are supported by empirical illustrations on financial stock data and ecological grassland data. The practical impact is a concrete procedure to assess the precision of CCA estimates in high-dimensional applications and a framework to detect and quantify shared structure when dimensions dominate samples.

Abstract

Paper Structure (32 sections, 24 theorems, 112 equations, 15 figures, 1 table)

This paper contains 32 sections, 24 theorems, 112 equations, 15 figures, 1 table.

Introduction
Background
High-dimensional CCA setup
Other related literature
CCA vs. sparse and regularized CCA
CCA vs. PCA and factor models
CCA vs. $F$-matrix
Outline of the paper
Basic framework
Population setting
Sample setting
Results
Implications of Theorem \ref{['Theorem_basic_setting']}
General framework
Non-Gaussian data
...and 17 more sections

Key Result

Lemma 2.3

The number $r^2$ equals the single nonzero eigenvalue of the $K\times K$ matrix $(\mathbb E \mathbf u \mathbf u^\mathsf T)^{-1} (\mathbb E\mathbf u \mathbf v^\mathsf T) (\mathbb E\mathbf v \mathbf v^\mathsf T)^{-1} (\mathbb E\mathbf v \mathbf u^\mathsf T)$ and the single nonzero eigenvalue of the ${

Figures (15)

Figure 1: Illustration of Theorem \ref{['Theorem_basic_setting']}: Histogram of the squared sample canonical correlations from one simulation with $K=1000$, $M=1500$, $S=8000$, $r^2=0.49$. We observe a single spike in the correlations approximately at $z_\rho$ location. The density of the Wachter distribution with corresponding parameters is shown in orange.
Figure 2: Illustration of Eq. \ref{['eq_zrho']}, \ref{['eq_sx']}, \ref{['eq_sy']} for $K=1000,\,M=1500,\,S=8000$.
Figure 3: Comparison of theoretical and simulated results for ${K=500}$, ${M=2500}$, $S=8000$. Simulated curves are based on one simulation from a fixed value $\rho^2$.
Figure 4: The estimated canonical variable belongs to a cone whose axis is the true direction shown by the blue arrow. If $\sin^2\theta$ is small, then the cone is narrow, as shown in purple; if $\sin^2\theta$ is large, then the cone is wide, as shown in yellow.
Figure 5: Angles: theoretical (black solid) and simulated (blue dotted) between $\widehat{\mathbf x}$ and $\mathbf x$ and simulated (red dashed) between $\widehat{\boldsymbol\alpha}$ and $\boldsymbol\alpha$ for different covariance matrices. ${K=500}$, ${M=2500}$, $S=8000$. Simulated curves are based on one simulation from a fixed value $\rho^2$.
...and 10 more figures

Theorems & Definitions (49)

Example 2.1
Definition 2.2
Lemma 2.3
Definition 2.4
Theorem 2.5
Remark 2.6
Definition 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
...and 39 more

High-Dimensional Canonical Correlation Analysis

TL;DR

Abstract

High-Dimensional Canonical Correlation Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (49)