High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations
Victor Léger, Florent Chatelain
TL;DR
This work provides a rigorous high-dimensional theory for Partial Least Squares (PLS) in a two-dataset setting by modeling X and Y as a signal-plus-noise system with joint and individual components. Using random matrix theory, it derives deterministic equivalents for the cross-covariance resolvent, the limiting spectral distribution of the cross-covariance, and precise phase-transition thresholds for spike detectability. It also characterizes the alignment of PLS singular vectors with true signal directions, revealing fundamental limitations such as spurious alignment with individual components and noise-induced skewing of shared components, while proving PLS-SVD can outperform PCA in detecting shared latent structure. The results offer a comprehensive spectral perspective on PLS in high dimensions and motivate future work on filtering spurious spikes and extending the framework to other PLS variants.
Abstract
Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.
