Table of Contents
Fetching ...

Spectral decomposition-assisted multi-study factor analysis

Lorenzo Mauri, Niccolò Anceschi, David B. Dunson

TL;DR

BLAST tackles high-dimensional covariance estimation across multiple studies by decomposing cross-study variability into a shared low-rank component $\mathbf{\Lambda}\mathbf{\Lambda}^{\top}$, study-specific low-rank components $\mathbf{\Gamma}_s\mathbf{\Gamma}_s^{\top}$, and diagonal noise $\mathbf{\Sigma}$. It uses a spectral factorization step to identify the shared subspace, followed by surrogate Bayesian regression for fast, parallelizable inference of loadings and residuals, avoiding expensive MCMC. Theoretical guarantees include Procrustes consistency for latent factors, posterior contraction and a CLT/ Bernstein–von Mises for the low-rank components, with variance inflation ensuring valid coverage; the framework is robust to heteroscedasticity and can handle moment-based assumptions. Empirically, BLAST demonstrates competitive accuracy and well-calibrated uncertainty in simulations and a gene-expression integration application, offering substantial computational speedups and scalability for large omics datasets.

Abstract

This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions, with a novel approach for separating shared and specific factors, and infers the factor loadings and residual variances via surrogate Bayesian regressions. The resulting posterior has a simple product form across outcomes, bypassing the need for Markov chain Monte Carlo sampling and facilitating parallelization. The proposed methodology has major advantages over current Bayesian competitors in terms of computational speed, scalability and stability while also having strong frequentist guarantees. The theory and methods also add to the rich literature on frequentist methods for factor models with shared and group-specific components of variation. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.

Spectral decomposition-assisted multi-study factor analysis

TL;DR

BLAST tackles high-dimensional covariance estimation across multiple studies by decomposing cross-study variability into a shared low-rank component , study-specific low-rank components , and diagonal noise . It uses a spectral factorization step to identify the shared subspace, followed by surrogate Bayesian regression for fast, parallelizable inference of loadings and residuals, avoiding expensive MCMC. Theoretical guarantees include Procrustes consistency for latent factors, posterior contraction and a CLT/ Bernstein–von Mises for the low-rank components, with variance inflation ensuring valid coverage; the framework is robust to heteroscedasticity and can handle moment-based assumptions. Empirically, BLAST demonstrates competitive accuracy and well-calibrated uncertainty in simulations and a gene-expression integration application, offering substantial computational speedups and scalability for large omics datasets.

Abstract

This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions, with a novel approach for separating shared and specific factors, and infers the factor loadings and residual variances via surrogate Bayesian regressions. The resulting posterior has a simple product form across outcomes, bypassing the need for Markov chain Monte Carlo sampling and facilitating parallelization. The proposed methodology has major advantages over current Bayesian competitors in terms of computational speed, scalability and stability while also having strong frequentist guarantees. The theory and methods also add to the rich literature on frequentist methods for factor models with shared and group-specific components of variation. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.

Paper Structure

This paper contains 22 sections, 25 theorems, 174 equations, 2 figures, 12 tables, 2 algorithms.

Key Result

Theorem 1

Suppose Assumptions assumption:model--assumption:sigma hold and $n_s = \mathcal{O}(n_{\min}^2)$, where $n_{\min} = \min_{s=1, \dots, S} n_s$, for all $s=1, \dots, S$, then, as $n_1, \dots, n_s, p \to \infty$, with probability at least $1-o(1)$,

Figures (2)

  • Figure 1: Reconstructed within-study correlation matrices (left and middle panels) and rescaled shared component (right panel) for 1000 genes. Elements for which the $95\%$ credible intervals included $0$ were set to $0$.
  • Figure 2: Common gene co-expression network obtained using GEPHIgephi among 392 genes. Nodes (edges) represent genes (positive dependencies). Node size is proportional to the node degree. Nodes are divided into four main clusters based on their connections.

Theorems & Definitions (57)

  • Remark 1
  • Remark 2
  • Theorem 1: Recovery of latent factors
  • Remark 3
  • Theorem 2: Consistency and posterior contraction
  • Remark 4
  • Remark 5: Consistency holds under heteroscedasticity
  • Remark 6: Extension to Frobenius loss
  • Theorem 3: Central limit theorem
  • Remark 7
  • ...and 47 more