Table of Contents
Fetching ...

High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations

Victor Léger, Florent Chatelain

TL;DR

This work provides a rigorous high-dimensional theory for Partial Least Squares (PLS) in a two-dataset setting by modeling X and Y as a signal-plus-noise system with joint and individual components. Using random matrix theory, it derives deterministic equivalents for the cross-covariance resolvent, the limiting spectral distribution of the cross-covariance, and precise phase-transition thresholds for spike detectability. It also characterizes the alignment of PLS singular vectors with true signal directions, revealing fundamental limitations such as spurious alignment with individual components and noise-induced skewing of shared components, while proving PLS-SVD can outperform PCA in detecting shared latent structure. The results offer a comprehensive spectral perspective on PLS in high dimensions and motivate future work on filtering spurious spikes and extending the framework to other PLS variants.

Abstract

Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.

High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations

TL;DR

This work provides a rigorous high-dimensional theory for Partial Least Squares (PLS) in a two-dataset setting by modeling X and Y as a signal-plus-noise system with joint and individual components. Using random matrix theory, it derives deterministic equivalents for the cross-covariance resolvent, the limiting spectral distribution of the cross-covariance, and precise phase-transition thresholds for spike detectability. It also characterizes the alignment of PLS singular vectors with true signal directions, revealing fundamental limitations such as spurious alignment with individual components and noise-induced skewing of shared components, while proving PLS-SVD can outperform PCA in detecting shared latent structure. The results offer a comprehensive spectral perspective on PLS in high dimensions and motivate future work on filtering spurious spikes and extending the framework to other PLS variants.

Abstract

Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.

Paper Structure

This paper contains 45 sections, 17 theorems, 201 equations, 7 figures.

Key Result

Theorem 1

Some deterministic equivalents $\bar{{\mathbf{Q}}}$ of ${\mathbf{Q}}$, and $\bar{\tilde{{\mathbf{Q}}}}$ of $\tilde{{\mathbf{Q}}}$, are given by where $\tilde{m}(z) = \frac{q}{p}m(z) -\frac{1-\frac{q}{p}}{z}$, and $(z,m(z))$ is the unique solution in $\mathcal{Z}\left(\mathbb{C}\backslash\mathop{\mathrm{Supp}}\nolimits(\mu)\right)$ of The matrices $\bar{{\mathbf{Q}}}_{X}$ and $\bar{{\mathbf{Q}}}_

Figures (7)

  • Figure 1: Empirical distribution of the squared singular values of $\mathbf{S}_{XY}$ together with the limiting spectral distribution predicted by Proposition \ref{['prop:lsd']}, shown for different experimental settings. Left column: $\beta_p = 1/6$ and $\beta_q = 1/2$; middle column: $\beta_p = 50$ and $\beta_q = 2$; right column: $\beta_p=10/3$ and $\beta_q = 4$. From the first to the second row, the dimensions $p$, $q$, and $n$ are each multiplied by a factor of $5$.
  • Figure 2: (Left) Empirical distribution of the squared singular values of $\mathbf{S}_{XY}$, including the spikes generated by ${\mathbf{M}}$ and ${\mathbf{N}}$, together with the limiting spike locations $\xi_{M,k}$ and $\xi_{N,k}$ predicted by Proposition \ref{['prop:isolated_ST']}. (Right) Limiting alignment $\zeta_M$ (resp. $\zeta_N$) as functions of $\lambda_{M}$ (resp. $\lambda_N$) predicted by Proposition \ref{['prop:alignment_ST']}, shown alongside the empirical alignments of the corresponding singular vectors (bars). Experimental settings:$\beta_p = 10$, $\beta_q = 2$ ($n = 8000$), $r_M=2$ with $\lambda_{M,1} = 25$ and $\lambda_{M,2} = 10$, $r_N=2$ with $\lambda_{N,1} = 35$ and $\lambda_{N,2} = 15$, $r=0$ (${\mathbf{T}}={\mathbf{0}}$, i.e., no common component).
  • Figure 3: Empirical means of the top singular vectors due to ${\mathbf{M}}$ and ${\mathbf{N}}$ that do not align on any deterministic component. (Left) Empirical mean of $\hat{{\mathbf{u}}}_{N,1}$, i.e., the top left singular vector associated with ${\mathbf{N}}$. (Right) Empirical mean of $\hat{{\mathbf{v}}}_{M,1}$, i.e., the top right singular vector associated with ${\mathbf{M}}$. Experimental settings: same as Fig. \ref{['fig:spikes_ST']}, with $1000$ Monte-Carlo runs used to compute the empirical means.
  • Figure 4: (Left) Empirical distribution of the squared singular values of $\mathbf{S}_{XY}$ together with the limiting spike locations $\xi_{T,k}$ predicted by Proposition \ref{['prop:isolated_PR']}. (Right) Limiting alignments $\zeta_{P,k}$ and $\zeta_{R,k}$ as functions of $\lambda_{T,k}$, as predicted by Proposition \ref{['prop:alignment_PR']}, shown alongside the empirical alignments of the corresponding singular vectors (bars). Experimental settings:$\beta_p = 10$, $\beta_q = 2$ ($n = 8000$), $r=2$ with $\lambda_{P,1} = 25$, $\lambda_{P,2} = 10$, $\lambda_{R,1} = 3.5$ and $\lambda_{R,2} = 1.5$ ($\lambda_{T,1} = 84.6$, $\lambda_{T,2} = 36.6$, $\tilde{\lambda}_{P,1}=23.3$, $\tilde{\lambda}_{P,2}=11.7$, $\tilde{\lambda}_{R,1}=2.81$ and $\tilde{\lambda}_{R,2}=2.19$).
  • Figure 5: Spike locations and alignments for both specific and common components. The matrices ${\mathbf{M}}$ and ${\mathbf{P}}$ are generated from independent Gaussian variables, which naturally ensures that Assumption \ref{['ass:nontriviality']} is satisfied, even in the absence of strict orthogonality between the column spaces of ${\mathbf{M}}^{\top}$ and ${\mathbf{P}}$. (Left) Empirical distribution of the squared singular values of $\mathbf{S}_{XY}$ together with the limiting spike locations $\xi_{M,1}$ predicted by Proposition \ref{['prop:isolated_ST']} and $\xi_{T,1}$ predicted by Proposition \ref{['prop:isolated_PR']}. (Right) Limiting alignment $\zeta_{M,1}$ as functions of $\lambda_{M,1}$, and $\zeta_{P,1}$ and $\zeta_{R,1}$ as functions of $\lambda_{T,1}$, as predicted respectively by Propositions \ref{['prop:alignment_ST']} and \ref{['prop:alignment_PR']}, shown alongside the empirical alignments of the corresponding singular vectors (bars). Experimental settings:$\beta_p = 10$, $\beta_q = 2$ ($n = 8000$), $r_M=1$ with $\lambda_{M,1} = 20$, $r=1$ with $\lambda_{P,1} = 10$, $\lambda_{R,1} = 4$ ($\lambda_{T,1} = 54$, $\tilde{\lambda}_{P,1} = 10$, $\tilde{\lambda}_{R,1} = 4$).
  • ...and 2 more figures

Theorems & Definitions (27)

  • Remark 1: On the noise model
  • Definition 1: Deterministic Equivalent
  • Definition 2: Stieltjes Transform
  • Definition 3
  • Theorem 1: Deterministic equivalent
  • Remark 2: On the Marčenko--Pastur Stieltjes transform
  • Proposition 2: Limiting Singular Distribution of $\mathbf{S}_{XY}$
  • Remark 3: On the squared singular values
  • Remark 4: On the confinement of the spectrum
  • Lemma 1: Asymptotical orthogonality of the eigenspaces
  • ...and 17 more