Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures

Oleg Balabanov; Matthias Beaupere; Laura Grigori; Victor Lederer

Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures

Oleg Balabanov, Matthias Beaupere, Laura Grigori, Victor Lederer

TL;DR

This work introduces a block subsampled randomized Hadamard transform (block SRHT) as a distributed-friendly oblivious subspace embedding for low-rank approximation. By constructing $\mathbf{\Omega}$ blockwise from SRHTs, the authors prove that the block SRHT achieves the $(\varepsilon,\delta,d)$-OSE with a row count comparable to standard SRHT, while enabling efficient communication-lean application on distributed architectures. They unify and analyze randomized low-rank methods—RSVD, Nyström, and single-view—within a projection-based framework that relies solely on the OSE property, ensuring compatibility with block SRHT. Numerical experiments on large SPD matrices and tall-and-skinny problems show that block SRHT matches Gaussian embeddings in accuracy but offers up to about $2.5$-fold speedups in practical distributed settings, with strong and weak scalability up to thousands of processors. The results indicate that block SRHT provides practically significant performance benefits without sacrificing theoretical guarantees, making it well-suited for large-scale, distributed numerical linear algebra tasks.

Abstract

This article introduces a novel structured random matrix composed blockwise from subsampled randomized Hadamard transforms (SRHTs). The block SRHT is expected to outperform well-known dimension reduction maps, including SRHT and Gaussian matrices, on distributed architectures with not too many cores compared to the dimension. We prove that a block SRHT with enough rows is an oblivious subspace embedding, i.e., an approximate isometry for an arbitrary low-dimensional subspace with high probability. Our estimate of the required number of rows is similar to that of the standard SRHT. This suggests that the two transforms should provide the same accuracy of approximation in the algorithms. The block SRHT can be readily incorporated into randomized methods, for instance to compute a low-rank approximation of a large-scale matrix. For completeness, we revisit some common randomized approaches for this problem such as Randomized Singular Value Decomposition and Nyström approximation, with a discussion of their accuracy and implementation on distributed architectures.

Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures

TL;DR

This work introduces a block subsampled randomized Hadamard transform (block SRHT) as a distributed-friendly oblivious subspace embedding for low-rank approximation. By constructing

blockwise from SRHTs, the authors prove that the block SRHT achieves the

-OSE with a row count comparable to standard SRHT, while enabling efficient communication-lean application on distributed architectures. They unify and analyze randomized low-rank methods—RSVD, Nyström, and single-view—within a projection-based framework that relies solely on the OSE property, ensuring compatibility with block SRHT. Numerical experiments on large SPD matrices and tall-and-skinny problems show that block SRHT matches Gaussian embeddings in accuracy but offers up to about

-fold speedups in practical distributed settings, with strong and weak scalability up to thousands of processors. The results indicate that block SRHT provides practically significant performance benefits without sacrificing theoretical guarantees, making it well-suited for large-scale, distributed numerical linear algebra tasks.

Abstract

Paper Structure (12 sections, 6 theorems, 51 equations, 6 figures, 3 algorithms)

This paper contains 12 sections, 6 theorems, 51 equations, 6 figures, 3 algorithms.

Introduction
Block sub-sampled randomized Hadamard transform
Randomized low-rank approximation
Randomized Singular Value Decomposition
Nyström approximation
Single-view approximation of non-psd matrix
Numerical experiments
Nyström approximation
Cost of application to tall-and-skinny matrix
Proof of the main theorem
Conclusion
Acknowledgments

Key Result

Theorem 2.1

Let $0< \varepsilon <1$ and $0< \delta <1$. Let $\mathbf{\Omega} \in \mathbb{R}^{l \times n}$ be defined by eq:blockSRHT. If, then $\mathbf{\Omega}$ is an $(\varepsilon,\delta,d)$ OSE.

Figures (6)

Figure 1: Trace error $\|\mathbf{A}-[\![\mathbf{A}]\!]^{\mathrm{(Nyst)}}_k\|_{{*}} / \|\mathbf{A}\|_{{*}}$ using BSRHT.
Figure 2: Runtimes of computing $\mathbf{Y} = \mathbf{A} \mathbf{\Omega}^\mathrm{T}$ and $\mathbf{\Omega} \mathbf{Y}$ in \ref{['algo_proj_2d']} for different sampling sizes.
Figure 3: Strong scalability runtimes associated with computing $\mathbf{\Omega} \mathbf{V}$ with $n = 10^7$ and $l = 2000$, versus $p$. "Gauss. total" and "BSRHT total" correspond to the overall runtimes, whereas "Gauss. local" and "BSRHT local" stand for the max per-processor runtimes taken by local multiplications.
Figure 4: Strong scalability runtimes associated with computing $\mathbf{\Omega} \mathbf{V}$ with $n = 10^8$ and $l = 2000$, versus $p$. "Gauss. total" and "BSRHT total" correspond to the overall runtimes, whereas "Gauss. local" and "BSRHT local" stand for the max per-processor runtimes taken by local multiplications.
Figure 5: Max per-processor memory needed for computing $\mathbf{\Omega} \mathbf{V}$ with $n = 10^8$ and $l = 2000$, versus $p$.
...and 1 more figures

Theorems & Definitions (12)

Definition 1.1
Theorem 2.1: Main Theorem
Remark 3.1
Lemma 5.1
proof
Proposition 5.2
proof
Proposition 5.3
proof
Proposition 5.4: Corollary of Theorem 2.2 in tropp2011improved
...and 2 more

Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures

TL;DR

Abstract

Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)