High-Dimensional Canonical Correlation Analysis
Anna Bykhovskaya, Vadim Gorin
TL;DR
This paper addresses identifiability in high-dimensional canonical correlation analysis by showing that, when the sample size $S$ and dimensions $K,M$ grow proportionally, the classic sample CCA vectors are not consistent estimators of the population canonical vectors. It derives exact formulas for estimation error in the form of angular cones, with key phase-transition behavior governed by the threshold $\rho^2>\rho_c^2$, where $\rho_c^2=\frac{1}{\sqrt{(\tau_M-1)(\tau_K-1)}}$ and $z_\rho$ quantifies the spike location beyond the bulk Wachter support. The results extend beyond Gaussian data to fourth-moment Gaussian settings, correlated signals/noise, and multiple signals, and are supported by empirical illustrations on financial stock data and ecological grassland data. The practical impact is a concrete procedure to assess the precision of CCA estimates in high-dimensional applications and a framework to detect and quantify shared structure when dimensions dominate samples.
Abstract
This paper studies high-dimensional canonical correlation analysis (CCA) with an emphasis on the vectors that define canonical variables. The paper shows that when two dimensions of data grow to infinity jointly and proportionally, the classical CCA procedure for estimating those vectors fails to deliver a consistent estimate. This provides the first result on the impossibility of identification of canonical variables in the CCA procedure when all dimensions are large. As a countermeasure, the paper derives the magnitude of the estimation error, which can be used in practice to assess the precision of CCA estimates. Applications of the results to cyclical vs. non-cyclical stocks and to a limestone grassland data set are provided.
