Table of Contents
Fetching ...

PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning

Mingqi Wu, Qiang Sun, Yi Yang

TL;DR

This work studies how to robustly recover a shared low‑dimensional signal from paired high‑dimensional data corrupted by structured background. It first analyzes alignment‑only PCA (PCA+) and then introduces PCA++, a hard uniformity constrained variant that reduces to a generalized eigenproblem and remains stable in high dimensions. The authors provide exact high‑dimensional asymptotics for both fixed aspect‑ratio and growing‑spike regimes, demonstrating that explicit feature dispersion regularizes against background interference. Empirically, PCA++ outperforms standard PCA and PCA+ on simulations, corrupted MNIST, and single‑cell RNA‑seq data, and the theory clarifies uniformity’s role as a robust regularizer in contrastive learning. Overall, the paper links uniformity to practical robustness, with implications for self‑supervised and multiview learning in noisy, high‑dimensional settings.

Abstract

High-dimensional data often contain low-dimensional signals obscured by structured background noise, which limits the effectiveness of standard PCA. Motivated by contrastive learning, we address the problem of recovering shared signal subspaces from positive pairs, paired observations sharing the same signal but differing in background. Our baseline, PCA+, uses alignment-only contrastive learning and succeeds when background variation is mild, but fails under strong noise or high-dimensional regimes. To address this, we introduce PCA++, a hard uniformity-constrained contrastive PCA that enforces identity covariance on projected features. PCA++ has a closed-form solution via a generalized eigenproblem, remains stable in high dimensions, and provably regularizes against background interference. We provide exact high-dimensional asymptotics in both fixed-aspect-ratio and growing-spike regimes, showing uniformity's role in robust signal recovery. Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity's role in contrastive learning, showing that explicit feature dispersion defends against structured noise and enhances robustness.

PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning

TL;DR

This work studies how to robustly recover a shared low‑dimensional signal from paired high‑dimensional data corrupted by structured background. It first analyzes alignment‑only PCA (PCA+) and then introduces PCA++, a hard uniformity constrained variant that reduces to a generalized eigenproblem and remains stable in high dimensions. The authors provide exact high‑dimensional asymptotics for both fixed aspect‑ratio and growing‑spike regimes, demonstrating that explicit feature dispersion regularizes against background interference. Empirically, PCA++ outperforms standard PCA and PCA+ on simulations, corrupted MNIST, and single‑cell RNA‑seq data, and the theory clarifies uniformity’s role as a robust regularizer in contrastive learning. Overall, the paper links uniformity to practical robustness, with implications for self‑supervised and multiview learning in noisy, high‑dimensional settings.

Abstract

High-dimensional data often contain low-dimensional signals obscured by structured background noise, which limits the effectiveness of standard PCA. Motivated by contrastive learning, we address the problem of recovering shared signal subspaces from positive pairs, paired observations sharing the same signal but differing in background. Our baseline, PCA+, uses alignment-only contrastive learning and succeeds when background variation is mild, but fails under strong noise or high-dimensional regimes. To address this, we introduce PCA++, a hard uniformity-constrained contrastive PCA that enforces identity covariance on projected features. PCA++ has a closed-form solution via a generalized eigenproblem, remains stable in high dimensions, and provably regularizes against background interference. We provide exact high-dimensional asymptotics in both fixed-aspect-ratio and growing-spike regimes, showing uniformity's role in robust signal recovery. Empirically, PCA++ outperforms standard PCA and alignment-only PCA+ on simulations, corrupted-MNIST, and single-cell transcriptomics, reliably recovering condition-invariant structure. More broadly, we clarify uniformity's role in contrastive learning, showing that explicit feature dispersion defends against structured noise and enhances robustness.

Paper Structure

This paper contains 59 sections, 13 theorems, 113 equations, 6 figures, 14 tables, 1 algorithm.

Key Result

Theorem 3.1

Under Assumptions asm:orthogonal--asm:noise, the contrastive covariance estimator $S_{n}^{+}$ satisfies $\mathbbm{E}\bigl[S_{n}^{+}] =A A^\top.$

Figures (6)

  • Figure 1: Subspace estimation error for standard PCA, ${\normalfont\texttt{\color{PCAcolor}PCA+}}$, ${\normalfont\texttt{\color{PCAcolor}PCA++}}$. Results are for Example \ref{['ex:counterexample']}. Left: (varying relative strength of the signal $\lambda_{A,1}/\sqrt{\lambda_{B,1}}$) As background strength grows, ${\normalfont\texttt{\color{PCAcolor}PCA+}}$ deteriorates sharply while ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ keeps its error uniformly low. Right: (varying aspect ratio $d/n$) Across all regimes, ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ outperforms both PCA and ${\normalfont\texttt{\color{PCAcolor}PCA+}}$.
  • Figure 2: Effect of covariance truncation on ${\normalfont\texttt{\color{PCAcolor}PCA++}}$. Results are for Example \ref{['ex:counterexample']}. Left: As $d/n$ increases, truncated ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ remains stable and accurate while untruncated ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ deteriorates sharply. Right: Truncated ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ with varying truncation ranks $s$ (fixed $s=2$; or $s$ as $0.1d, 0.2d, 0.4d$ of feature dimension $d$).
  • Figure 3: Empirical validation of theoretical predictions for ${\normalfont\texttt{\color{PCAcolor}PCA++}}$. Left: Validation in the fixed aspect ratio regime for Theorem \ref{['thm:dist']}. Right: Validation in the growing-spike regime for Theorem \ref{['thm:dist2']}.
  • Figure 4: 2D embeddings of noisy digit-over-grass images, standard PCA fails to separate classes. Contrastive ${\normalfont\texttt{\color{PCAcolor}PCA+}}$ shows partial, misaligned separation. In contrast, our ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ achieves clear class separation predominantly along its first eigenvector, highlighting its superior ability to isolate the true signal and background noise.
  • Figure 5: PCA vs. ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ embeddings. We apply PCA and ${\normalfont\texttt{\color{PCAcolor}PCA++}}$ to matched control and stimulated PBMCs (9,268 cells each) from the kang2018multiplexed dataset and visualize the top 50 components using UMAP. (a–c) show PCA embeddings of CD4 T cells, B cells, and NK cells, where control and stimulated cells are often separated despite minimal biological response. (d–f) show the same cells under ${\normalfont\texttt{\color{PCAcolor}PCA++}}$, where alignment across conditions improves, highlighting ${\normalfont\texttt{\color{PCAcolor}PCA++}}$’s ability to isolate stable, condition-invariant structure.
  • ...and 1 more figures

Theorems & Definitions (16)

  • Theorem 3.1: Unbiasedness of the contrastive covariance estimator
  • Theorem 3.2: Finite-sample performance of ${\normalfont\texttt{\color{PCAcolor}PCA+}}$
  • Example 3.3: One‑signal, one‑background
  • Theorem 3.4: Failure under strong background
  • Theorem 4.2: Asymptotic subspace error under hard uniformity
  • Theorem 4.4
  • Definition A.1: Principal angles
  • Remark A.2
  • Lemma F.1: Contrastive energy of sample directions
  • Lemma F.2: Contrastive energy in growing-spike regime
  • ...and 6 more