Table of Contents
Fetching ...

PCA recovery thresholds in low-rank matrix inference with sparse noise

Urte Adomaityte, Gabriele Sicuro, Pierpaolo Vivo

TL;DR

We address recovering a rank-one signal from a sparse symmetric noise matrix constructed from a configuration-model graph. The authors develop a replica analysis yielding recursive distributional equations that determine the typical top eigenvalue and the distribution of the top-eigenvector components, together with the spike–eigenvector overlap. They derive a sharp recovery threshold $\theta_{\rm crit}$ that generalizes the BBP transition to sparse noise and provide explicit formulas for Poissonian and random-regular degree distributions, with dense-connectivity limits recovering classical BBP results. Numerical diagonalisation corroborates the theoretical predictions, illustrating the phase transition between non-recovery and recovery regimes. The work advances understanding of PCA-like recovery under sparse noise and informs future directions in sparse PCA and related spectral methods.

Abstract

We study the high-dimensional inference of a rank-one signal corrupted by sparse noise. The noise is modelled as the adjacency matrix of a weighted undirected graph with finite average connectivity in the large size limit. Using the replica method from statistical physics, we analytically compute the typical value of the top eigenvalue, the top eigenvector component density, and the overlap between the signal vector and the top eigenvector. The solution is given in terms of recursive distributional equations for auxiliary probability density functions which can be efficiently solved using a population dynamics algorithm. Specialising the noise matrix to Poissonian and Random Regular degree distributions, the critical signal strength is analytically identified at which a transition happens for the recovery of the signal via the top eigenvector, thus generalising the celebrated BBP transition to the sparse noise case. In the large-connectivity limit, known results for dense noise are recovered. Analytical results are in agreement with numerical diagonalisation of large matrices.

PCA recovery thresholds in low-rank matrix inference with sparse noise

TL;DR

We address recovering a rank-one signal from a sparse symmetric noise matrix constructed from a configuration-model graph. The authors develop a replica analysis yielding recursive distributional equations that determine the typical top eigenvalue and the distribution of the top-eigenvector components, together with the spike–eigenvector overlap. They derive a sharp recovery threshold that generalizes the BBP transition to sparse noise and provide explicit formulas for Poissonian and random-regular degree distributions, with dense-connectivity limits recovering classical BBP results. Numerical diagonalisation corroborates the theoretical predictions, illustrating the phase transition between non-recovery and recovery regimes. The work advances understanding of PCA-like recovery under sparse noise and informs future directions in sparse PCA and related spectral methods.

Abstract

We study the high-dimensional inference of a rank-one signal corrupted by sparse noise. The noise is modelled as the adjacency matrix of a weighted undirected graph with finite average connectivity in the large size limit. Using the replica method from statistical physics, we analytically compute the typical value of the top eigenvalue, the top eigenvector component density, and the overlap between the signal vector and the top eigenvector. The solution is given in terms of recursive distributional equations for auxiliary probability density functions which can be efficiently solved using a population dynamics algorithm. Specialising the noise matrix to Poissonian and Random Regular degree distributions, the critical signal strength is analytically identified at which a transition happens for the recovery of the signal via the top eigenvector, thus generalising the celebrated BBP transition to the sparse noise case. In the large-connectivity limit, known results for dense noise are recovered. Analytical results are in agreement with numerical diagonalisation of large matrices.

Paper Structure

This paper contains 22 sections, 123 equations, 7 figures.

Figures (7)

  • Figure 1: The BBP-like transition for the matrix model Eq. \ref{['eq:definition_intro']}, where ${\mathbf{J}}$ is a (dense) GOE matrix constructed as ${\mathbf{J}}=(\mathbf G+\mathbf G^\intercal)/\sqrt{2N}$ with $G_{ij}\stackrel{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,1)$ and spike components $x_i\stackrel{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,1)$. In this case, the recovery threshold is known to be $\theta_{\mathrm{crit}}=1$Peche06Capitaine09Bloemendal_2012. Left: for signal strength $\theta\leq\theta_{\mathrm{crit}}$, as $N\to \infty$, the top eigenvalue of ${\mathbf{A}}$ sits at the right edge of the limiting spectral density (LSD) $\varrho(\lambda)=\frac{1}{2\pi}\sqrt{4-\lambda^2}, \lambda\in[-2,2]$ of ${\mathbf{J}}$. Right: for $\theta>\theta_{\mathrm{crit}}$, the top eigenvalue of ${\mathbf{A}}$ is an outlier at $\lambda_{\rm top}=\theta+\frac{1}{\theta}$Peche06Capitaine09Bloemendal_2012. The histogram represents the spectrum of a size $N=3\times 10^3$ matrix for $\theta=0$ (left) and $\theta=2$ (right).
  • Figure 2: Eigenvalue density of the matrix ${\mathbf{A}}$ of size $N=2000$ as defined in Eq. \ref{['eq:definition_intro']}, with ${\mathbf{J}}$ a pure adjacency matrix of a Poissonian graph with average connectivity $c=4$ and signal $x_i \sim \mathcal{N}(0,1)$. ( Left) Density for $\theta=0$, corresponding to ${\mathbf{A}}={\mathbf{J}}$, where the structural noise eigenvalue $\lambda_{\theta=0}$ is the sole outlier (diamond and dashed line). Matrices are generated using an unbiased configuration model algorithm that can be found in CoolenAnnibale17. ( Center) Density for $0<\theta<\theta_\mathrm{crit}$: the eigenvalue associated with the signal $\lambda_\theta$ (triangle and solid line) is either an outlier such that $\lambda_\theta<\lambda_{\theta=0}$, or is buried in the bulk, and the top eigenvector does not correlate with the signal vector ${\boldsymbol{x}}$. ( Right) For $\theta>\theta_\mathrm{crit}$, the eigenvalue associated with the signal $\lambda_\theta$ is the top eigenvalue, and the top eigenvector correlates non-trivially with the signal vector ${\boldsymbol{x}}$.
  • Figure 3: Poissonian setup with $k_{\mathrm{max}}=20$, $c=4$, $\varrho_W(W)=\delta(W-1)$ and signal $x_i \sim \mathcal{N}(0,1)$. All simulation results are averaged over direct diagonalisation of $25$ matrices ${\mathbf{A}}$ of size $N = 10^3$ ( left) and $50$ matrices of size $N = 2\times10^3$ ( right), with standard deviation error bars. Theoretical predictions are obtained via the population dynamics algorithm described in Appendix \ref{['app:popdyn']} with population size $N_p=2\times10^5$. ( Left) Average top eigenvalue (dots) and second top eigenvalue (crosses) from direct diagonalisation for various signal strengths $\theta$. For $\theta\leq\theta_{\mathrm{crit}}$, $\mathbb E[\lambda_{\rm top}]$ coincides with the structural top eigenvalue $\lambda_{\theta=0}$ of the symmetric noise matrix ${\mathbf{J}}$ (orange solid line). For $\theta>\theta_{\mathrm{crit}}$, the top eigenvalue correlates with the signal and its average takes the value $\lambda_{\theta}$, as predicted by Eq. \ref{['eq:gen_lambda_signal']} (red dotted line). The squared overlap in Eq. \ref{['eq:gen_overlap']} (blue dashed line) also matches the simulation results (diamonds) and exhibits a discontinuity at $\theta_{\mathrm{crit}}$. ( Right) Heatmap of the average overlap in the $(\theta,c)$ parameter space obtained via numerical diagonalisation. The transition value $\theta_{\mathrm{crit}}$ Eq. \ref{['eq:gen_transition']} (black dashed line) clearly separates the non-recovery and recovery phases.
  • Figure 4: Poissonian setup with $k_{\mathrm{max}}=20$, $c=4$, $\varrho_W(W)=\delta(W-1)$ and signal $x_i \sim \mathcal{N}(0,1)$ of strength $\theta=6$. The marginal cumulative distribution functions of $h$ and $\omega$ are obtained via population dynamics with population size $N_p=2\times 10^5$. ( Left) The marginal cumulative distribution function of the single-site bias fields $h$. ( Right) The marginal cumulative distribution function of the single-site inverse variances $\omega$, where the largest value of $\omega$ corresponds to the degree $k=1$ and is equal to $\mathbb E_{{\mathbf{A}}} [\lambda_{\rm top}]=\lambda_\theta$.
  • Figure 5: Poissonian setup with $k_{\mathrm{max}}=20$, $c=4$ and $\varrho_W(W)=\delta(W-1)$ for ${\mathbf{J}}$, and signal $x_i \sim \mathcal{N}(0,1)$ of strength $\theta=6>\theta_{\mathrm{crit}}$. Plots are obtained via the population dynamics algorithm described in Appendix \ref{['app:popdyn']} with population size $N_p=2\times10^5$. ( Top left) The top eigenvector component density obtained via Eq. \ref{['eq:top_eig_comp_density']} (blue line) and direct diagonalisation (crosses). We also plot the standard Gaussian pdf (dashed line), which would be expected in case of full recovery, as a guide for the eye. ( Top right) Overlap component density obtained via Eq. \ref{['eq:overlap_comp_density']} (blue line) and direct diagonalisation (crosses). ( Bottom left) Degree decomposition of the top eigenvector component cumulative distribution function (CDF) in Eq. \ref{['eq:top_eig_comp_density']}. ( Bottom right) Degree decomposition of the overlap component CDF in Eq. \ref{['eq:overlap_comp_density']}.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Remark 3.1: Degree decomposition
  • Remark 3.2: Interpretation of $q$
  • Remark 3.3: Reduction to the pure noise matrix
  • Remark 3.4: Dense limit