Empirical Bayes Linked Matrix Decomposition

Eric F. Lock

Empirical Bayes Linked Matrix Decomposition

Eric F. Lock

TL;DR

The paper addresses the problem of jointly decomposing multiple bidimensionally linked matrices (e.g., multi-omics data across cohorts) to recover shared, partially shared, and matrix-specific low-rank signals while imputing missing data. It introduces EV-BIDIFAC, an empirical variational Bayes approach that models X_{ij} = ∑_k U_i^{(k)} V_j^{(k)T} + E_{ij} and optimizes a model-based free-energy objective to obtain shrinkage-driven, tuning-parameter-free decompositions. The method supports bidimensional integration, provides a uniqueness framework, and extends to missing data via an EM-like imputation scheme, with strong performance demonstrated in extensive simulations and a BRCA TCGA application. The results show improved recovery of underlying structure, accurate decomposition into shared and specific components, and robust imputation, highlighting the practical impact for integrative genomics and other multi-omics contexts.

Abstract

Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular "omics" technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for "blockwise" imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.

Empirical Bayes Linked Matrix Decomposition

TL;DR

Abstract

Paper Structure (16 sections, 14 theorems, 48 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 14 theorems, 48 equations, 8 figures, 1 table, 2 algorithms.

Introduction
Notation and Setting
Single-matrix results
Bidimensional linked matrix factorization
Empirical variational BIDIFAC (EV-BIDIFAC)
Uniqueness
Missing data imputation
Simulations
Single matrix
Two linked matrices
Bidimensionally linked matrices
Application to BRCA Data
Conclusion and discussion
Proofs
Proof of Theorem \ref{['thm:ident']}
...and 1 more sections

Key Result

Proposition 1

For $\mathbf{X}: M \times N$, the minimizer of the least squares objective $||X-\mathbf{S}||_F^2$ under the constraint rank$(\mathbf{S})=R \leq \hbox{min} (M,N)$ is given by $\mathbf{S}=\mathbf{U}_\mathbf{X} \mathbf{D}_S \mathbf{V}_\mathbf{X}^T$ where $\mathbf{D}_\mathbf{S}$ is diagonal with $\mathb

Figures (8)

Figure 1: Error in estimating the underlying low-rank signal for a single matrix under different methods, and under different signal-to-noise (s2n) ratios. The left-panel gives relative squared error, and the right-panel gives oracle normalized standard error. All axes are on a log-scale.
Figure 2: Error in estimating underlying low-rank structure in which the rank-1 components have heterogenous signal sizes. The left-panel gives relative squared error (RSE), and the right-panel gives oracle normalized standard error (ONSE). All axes are on a log-scale.
Figure 3: Missing data imputation accuracy for different levels of missingness. The left column gives $\text{RSE}_{\text{miss}}$ and the right gives $\text{ONSE}_\text{miss}$.
Figure 4: Error in estimating the underlying low-rank signal for two linked matrices under different signal-to-noise (s2n) ratios. The left-panel gives the RSE, and the right-panel gives RDSE. All axes are on a log-scale.
Figure 5: RSE (left) and RDSE (right) for low-rank structure with heterogenous signal levels for two linked matrices.
...and 3 more figures

Theorems & Definitions (19)

Proposition 1
Proposition 2
Proposition 3
Corollary 4
Proposition 5
Theorem 6
Theorem 7
Theorem 8
Proposition 9
Theorem 10
...and 9 more

Empirical Bayes Linked Matrix Decomposition

TL;DR

Abstract

Empirical Bayes Linked Matrix Decomposition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (19)