Fast randomized algorithms for low-rank matrix approximations with applications in global comparative analysis of a class of data sets

Weiwei Xu; Weijie Shen; Wen Li; Weiguo Gao; Yingzhou Li

Fast randomized algorithms for low-rank matrix approximations with applications in global comparative analysis of a class of data sets

Weiwei Xu, Weijie Shen, Wen Li, Weiguo Gao, Yingzhou Li

TL;DR

The paper introduces a randomized two-phase algorithm to compute generalized singular values for a pair of low-rank data matrices $G_1$ and $G_2$, enabling fast and accurate GSVD-based comparative analysis. By constructing orthonormal bases via randomized projections and performing GSVD on compressed matrices, it achieves substantial runtime savings while preserving accuracy in angular distances, generalized fractions, and entropy measures. The authors provide perturbation bounds linking basis approximation errors to errors in GSVD-derived quantities and validate the approach on synthetic data and real genome-scale expression datasets, including yeast, human cell-cycle, and mice macrophage data. The work offers a practical, scalable tool for genome-scale comparative analyses and highlights the method’s robustness to missing data recovery strategies.

Abstract

Generalized singular values (GSVs) play an essential role in the comparative analysis. In the real world data for comparative analysis, both data matrices are usually numerically low-rank. This paper proposes a randomized algorithm to first approximately extract bases and then calculate GSVs efficiently. The accuracy of both basis extration and comparative analysis quantities, angular distances, generalized fractions of the eigenexpression, and generalized normalized Shannon entropy, are rigursly analyzed. The proposed algorithm is applied to both synthetic data sets and the genome-scale expression data sets. Comparing to other GSVs algorithms, the proposed algorithm achieves the fastest runtime while preserving sufficient accuracy in comparative analysis.

Fast randomized algorithms for low-rank matrix approximations with applications in global comparative analysis of a class of data sets

TL;DR

The paper introduces a randomized two-phase algorithm to compute generalized singular values for a pair of low-rank data matrices

and

, enabling fast and accurate GSVD-based comparative analysis. By constructing orthonormal bases via randomized projections and performing GSVD on compressed matrices, it achieves substantial runtime savings while preserving accuracy in angular distances, generalized fractions, and entropy measures. The authors provide perturbation bounds linking basis approximation errors to errors in GSVD-derived quantities and validate the approach on synthetic data and real genome-scale expression datasets, including yeast, human cell-cycle, and mice macrophage data. The work offers a practical, scalable tool for genome-scale comparative analyses and highlights the method’s robustness to missing data recovery strategies.

Abstract

Paper Structure (9 sections, 8 theorems, 50 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 9 sections, 8 theorems, 50 equations, 5 figures, 5 tables, 2 algorithms.

Introduction
Randomized algorithms for low-rank matrix approximations for GSVs
Comparative analysis of a class of genome-scale expression data sets
Numerical experiments
Synthetic data sets
Genome-scale expression data sets
Yeast and human cell-cycle expression data set
Mice macrophage gene expression data set
Conclusion

Key Result

Lemma 3.1

\newlabelc2 Let $Q_1$ and $\epsilon$ be given by alg:basis-ext, then in step 7 there exist $i$ such that $\left\| Q_1 Q_1^{\mathrm{H}} G_1 - G_1 \right\|_{\mathrm{F}}<\epsilon$, and there exist $j$ such that $\left\| Q_2 Q_2^{\mathrm{H}} G_2 - G_2 \right\|_{\mathrm{F}} < \epsilon$.

Figures (5)

Figure 4.1: Runtime (second) and absolute errors for cases of $(m, p, n)$. (a1) with $m = n + 100$ and $p = n + 5$; (a2) with $m = n + 100$ and $p = n - 5$; (a3) with $m = n - 100$ and $p = n - 5$.
Figure 4.2: Relation between $\left\| G_i - Q_i Q_i^{\mathrm{H}} G_i \right\|_{\mathrm{F}}$ and $\left\| \Sigma_i^\star - \Sigma_i \right\|_{\mathrm{F}}$, for $i = 1, 2$.
Figure 4.3: $P_{i,\nu}$, $D_i$ and $\vartheta_{\nu}$ computed by \ref{['alg:rand-gsv']} for yeast and human cell-cycle expression data set with SVD interpolation.
Figure 4.4: $P_{i,\nu}$, $D_i$ and $\vartheta_{\nu}$ computed by \ref{['alg:rand-gsv']} for yeast and human cell-cycle expression data set with spline.
Figure 4.5: $P_{i,\nu}$, $D_i$ and $\vartheta_{\nu}$ computed by \ref{['alg:rand-gsv']} for mice macrophage gene expression data set.

Theorems & Definitions (11)

Lemma 3.1
proof
Proposition 3.2: Proposition 10.1 hmt11
Proposition 3.3: Proposition 10.2 hmt11
Theorem 3.4: Theorem 3.3.16 hj91
Theorem 3.5
proof
Lemma 3.6
Corollary 3.7
Theorem 3.8
...and 1 more

Fast randomized algorithms for low-rank matrix approximations with applications in global comparative analysis of a class of data sets

TL;DR

Abstract

Fast randomized algorithms for low-rank matrix approximations with applications in global comparative analysis of a class of data sets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)