Table of Contents
Fetching ...

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Hussam Al Daas, Grey Ballard, Laura Grigori, Md Taufique Hussain, Suraj Kumar, Mohammad Marufur Rahman, Kathryn Rouse

Abstract

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large matrices often necessitates distributed memory algorithms, where communication overhead becomes a critical bottleneck on modern supercomputing clusters. Despite its growing relevance, distributed-memory parallel strategies for sketching remain largely unexplored. In this work, we establish communication lower bounds for sketching using dense matrices that determine how much data movement is required to perform it in parallel. One important observation of our lower bounds is that no communication is required for a small number of processors. We show that our lower bounds are tight by presenting communication optimal algorithms. Furthermore, we extend our approach to determine communication lower bounds for computations of Nyström approximation where sketching is applied twice. We also introduce novel parallel algorithms whose communication costs are close to the lower bounds. Finally, we implement our algorithms on modern state-of-the-art supercomputing infrastructures which have both CPU- and GPU-equipped systems and demonstrate their parallel scalability.

Communication Lower Bounds and Algorithms for Sketching with Random Dense Matrices

Abstract

Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large matrices often necessitates distributed memory algorithms, where communication overhead becomes a critical bottleneck on modern supercomputing clusters. Despite its growing relevance, distributed-memory parallel strategies for sketching remain largely unexplored. In this work, we establish communication lower bounds for sketching using dense matrices that determine how much data movement is required to perform it in parallel. One important observation of our lower bounds is that no communication is required for a small number of processors. We show that our lower bounds are tight by presenting communication optimal algorithms. Furthermore, we extend our approach to determine communication lower bounds for computations of Nyström approximation where sketching is applied twice. We also introduce novel parallel algorithms whose communication costs are close to the lower bounds. Finally, we implement our algorithms on modern state-of-the-art supercomputing infrastructures which have both CPU- and GPU-equipped systems and demonstrate their parallel scalability.
Paper Structure (31 sections, 9 theorems, 7 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 31 sections, 9 theorems, 7 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Consider any positive integers $\ell$ and $m$ and any $m$ projections $\phi_j:\mathbb{Z}^\ell\rightarrow\mathbb{Z}^{\ell_j}$ ($\ell_j\leq \ell$), each of which extracts $\ell_j$ coordinates $S_j\subseteq [\ell]$ and forgets the $\ell-\ell_j$ others. Define $\mathcal{C} = \{{\bm{\mathbf{s}}} \in[0,1]

Figures (8)

  • Figure 1: Iteration space of ${\bm{\mathbf{B}}} = {\bm{\mathbf{A}}}{\bm{\mathbf{\Omega}}}$ and ${\bm{\mathbf{\Omega}}}^{\sf T}{\bm{\mathbf{B}}}={\bm{\mathbf{C}}}$ computation with a total of $n(n+r)r$ iteration points. The faces show the accesses to different matrices and the shading corresponds to distribution of the computation across 3 processors. The prism on the left depicts the algorithm with $p_1=q_3=P$ (1D algorithms with Redistribution of ${\bm{\mathbf{B}}}$), and the prism on the right depicts the algorithm with $p_1=q_1=P$ (1D algorithms with No-Redistribution of ${\bm{\mathbf{B}}}$).
  • Figure 2: Comparison of 3D-distributed memory matrix multiplication in C++ and Python using CPUs only. Experiment is performed by multiplying two $50k \times 50k$ double precision matrices and using a processor grid that is as cubical as possible.
  • Figure 3: Comparison between generating ${\bm{\mathbf{\Omega}}}$ redundantly vs communicating ${\bm{\mathbf{\Omega}}}$ in CPU-only systems for different values of $r$ to compute ${\bm{\mathbf{B}}}={\bm{\mathbf{A}}}{\bm{\mathbf{\Omega}}}$ where ${\bm{\mathbf{A}}}$ is a CIFAR10 kernel matrix with dimensions $50k \times 50k$.
  • Figure 4: Total runtime of multiplying $10^6 \times 10^6$ genetic dissimilarity matrix with a random matrix with $10^3$ columns.
  • Figure 5: Total runtime of Nyström computation with Redist and No-Redist algorithms on both CPU-only and GPU equipped systems when approximating CIFAR10 kernel matrix to different ranks.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Lemma 1
  • Lemma 2: ABGKR22
  • Definition 1: BV04
  • Lemma 3: ABGKR22
  • Lemma 4: BR20
  • Theorem 1
  • Lemma 5
  • Theorem 2
  • Lemma 6
  • Theorem 3