Table of Contents
Fetching ...

Representation Learning with Blockwise Missingness and Signal Heterogeneity

Ziqi Liu, Ye Tian, Weijing Tang

TL;DR

This work tackles unified representation learning from multi-source data with structured blockwise missingness and blockwise signal heterogeneity. It introduces Anchor Projected PCA (APPCA), a two-stage approach that first recovers robust groupwise subspaces using all observed blocks and then aligns groups by projecting a shared anchor block onto those subspaces, achieving denoising and robust global embeddings. Theoretical contributions include a fine-grained perturbation analysis with a novel spectral-slicing technique that yields representation reconstruction bounds that do not depend on the conditioning of the subject embeddings, plus chain-link extensions for general missingness patterns. Empirical results from simulations and a TEA-seq single-cell dataset demonstrate that APPCA outperforms shared-block PCA and two-step alignment baselines, particularly under weak or heterogeneous signals, highlighting its practical utility for multimodal data integration. The framework provides a principled, robust path for deriving globally aligned representations from partially observed, heterogeneous data sources with broad applicability in healthcare and genomics.

Abstract

Unified representation learning for multi-source data integration faces two important challenges: blockwise missingness and blockwise signal heterogeneity. The former arises from sources observing different, yet potentially overlapping, feature sets, while the latter involves varying signal strengths across subject groups and feature sets. While existing methods perform well with fully observed data or uniform signal strength, their performance degenerates when these two challenges coincide, which is common in practice. To address this, we propose Anchor Projected Principal Component Analysis (APPCA), a general framework for representation learning with structured blockwise missingness that is robust to signal heterogeneity. APPCA first recovers robust group-specific column spaces using all observed feature sets, and then aligns them by projecting shared "anchor" features onto these subspaces before performing PCA. This projection step induces a significant denoising effect. We establish estimation error bounds for embedding reconstruction through a fine-grained perturbation analysis. In particular, using a novel spectral slicing technique, our bound eliminates the standard dependency on the signal strength of subject embeddings, relying instead solely on the signal strength of integrated feature sets. We validate the proposed method through extensive simulation studies and an application to multimodal single-cell sequencing data.

Representation Learning with Blockwise Missingness and Signal Heterogeneity

TL;DR

This work tackles unified representation learning from multi-source data with structured blockwise missingness and blockwise signal heterogeneity. It introduces Anchor Projected PCA (APPCA), a two-stage approach that first recovers robust groupwise subspaces using all observed blocks and then aligns groups by projecting a shared anchor block onto those subspaces, achieving denoising and robust global embeddings. Theoretical contributions include a fine-grained perturbation analysis with a novel spectral-slicing technique that yields representation reconstruction bounds that do not depend on the conditioning of the subject embeddings, plus chain-link extensions for general missingness patterns. Empirical results from simulations and a TEA-seq single-cell dataset demonstrate that APPCA outperforms shared-block PCA and two-step alignment baselines, particularly under weak or heterogeneous signals, highlighting its practical utility for multimodal data integration. The framework provides a principled, robust path for deriving globally aligned representations from partially observed, heterogeneous data sources with broad applicability in healthcare and genomics.

Abstract

Unified representation learning for multi-source data integration faces two important challenges: blockwise missingness and blockwise signal heterogeneity. The former arises from sources observing different, yet potentially overlapping, feature sets, while the latter involves varying signal strengths across subject groups and feature sets. While existing methods perform well with fully observed data or uniform signal strength, their performance degenerates when these two challenges coincide, which is common in practice. To address this, we propose Anchor Projected Principal Component Analysis (APPCA), a general framework for representation learning with structured blockwise missingness that is robust to signal heterogeneity. APPCA first recovers robust group-specific column spaces using all observed feature sets, and then aligns them by projecting shared "anchor" features onto these subspaces before performing PCA. This projection step induces a significant denoising effect. We establish estimation error bounds for embedding reconstruction through a fine-grained perturbation analysis. In particular, using a novel spectral slicing technique, our bound eliminates the standard dependency on the signal strength of subject embeddings, relying instead solely on the signal strength of integrated feature sets. We validate the proposed method through extensive simulation studies and an application to multimodal single-cell sequencing data.
Paper Structure (18 sections, 5 theorems, 26 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 18 sections, 5 theorems, 26 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

With high probability, the Anchor Projected PCA satisfies:

Figures (4)

  • Figure 1: Schematic of the blockwise missingness pattern. Rows correspond to subject groups and columns to feature blocks. Dark regions represent observed data, whereas light regions indicate missing blocks. The area enclosed by the blue dashed line represents the $2 \times 3$ scenario analyzed in Section \ref{['subsec:challenge']} and the simulation study. The region within the red dotted line illustrates the $3 \times 3$ configuration also evaluated in the simulations.
  • Figure 2: Log-log plot of the estimation error of $\boldsymbol{\Theta}$ with respect to dimension $n=p$.
  • Figure 3: Log-log plots of the estimation error of $\hat{\boldsymbol{\Theta}}$ under the $2\times3$ setting. Left: fixed-sample-size regime with $n=500$ and increasing feature dimension $p$. Right: fixed-dimension regime with $p=500$ and increasing sample size $n$.
  • Figure 4: Log-log plots of the estimation error of $\widehat{\boldsymbol{\Theta}}$ under the $3\times3$ setting. Left: balanced-growth regime with $n \asymp p$. Middle: fixed-sample-size regime with $n=500$ and increasing $p$. Right: fixed-dimension regime with $p=500$ and increasing $n$.

Theorems & Definitions (9)

  • Theorem : Informal
  • Theorem 1
  • Remark 1
  • Theorem 2
  • Remark 2: Analysis of Global PCA Error
  • Remark 3: Error Bound Comparison
  • Theorem 3
  • Remark 4
  • Proposition 1