Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

Chetan Gupta; Janne H. Korhonen; Jan Studený; Jukka Suomela; Hossein Vahidi

Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

Chetan Gupta, Janne H. Korhonen, Jan Studený, Jukka Suomela, Hossein Vahidi

TL;DR

This work advances distributed sparse matrix multiplication in the low-bandwidth model by delivering faster algorithms and broadening the sparsity notions considered. The authors introduce a refined triangle-processing approach that reduces rounds to $O(d^{1.867})$ for semirings and $O(d^{1.832})$ for fields, leveraging a novel few-triangles technique and balanced virtual instances. They extend the analysis beyond uniform sparsity to $ extsf{US}$, $ extsf{RS}$, $ extsf{CS}$, $ extsf{BD}$, $ extsf{AS}$, and general matrices, providing a near-complete complexity classification across combinations and achieving $O(d^2 + \log n)$ bounds in many cases. The bounded-degeneracy perspective highlights a practical sparsity model that bridges dense and sparse regimes, with BD subsuming RS/CS and enabling efficient algorithms under structure knowledge. Lower bounds via broadcast, routing, and communication complexity confirm fundamental limits and guide future improvements, including removing support assumptions and tightening gaps in the sparsity-class table.

Abstract

In prior work, Gupta et al. (SPAA 2022) presented a distributed algorithm for multiplying sparse $n \times n$ matrices, using $n$ computers. They assumed that the input matrices are uniformly sparse--there are at most $d$ non-zeros in each row and column--and the task is to compute a uniformly sparse part of the product matrix. The sparsity structure is globally known in advance (this is the supported setting). As input, each computer receives one row of each input matrix, and each computer needs to output one row of the product matrix. In each communication round each computer can send and receive one $O(\log n)$-bit message. Their algorithm solves this task in $O(d^{1.907})$ rounds, while the trivial bound is $O(d^2)$. We improve on the prior work in two dimensions: First, we show that we can solve the same task faster, in only $O(d^{1.832})$ rounds. Second, we explore what happens when matrices are not uniformly sparse. We consider the following alternative notions of sparsity: row-sparse matrices (at most $d$ non-zeros per row), column-sparse matrices, matrices with bounded degeneracy (we can recursively delete a row or column with at most $d$ non-zeros), average-sparse matrices (at most $dn$ non-zeros in total), and general matrices.

Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

TL;DR

for semirings and

for fields, leveraging a novel few-triangles technique and balanced virtual instances. They extend the analysis beyond uniform sparsity to

, and general matrices, providing a near-complete complexity classification across combinations and achieving

bounds in many cases. The bounded-degeneracy perspective highlights a practical sparsity model that bridges dense and sparse regimes, with BD subsuming RS/CS and enabling efficient algorithms under structure knowledge. Lower bounds via broadcast, routing, and communication complexity confirm fundamental limits and guide future improvements, including removing support assumptions and tightening gaps in the sparsity-class table.

Abstract

In prior work, Gupta et al. (SPAA 2022) presented a distributed algorithm for multiplying sparse

matrices, using

computers. They assumed that the input matrices are uniformly sparse--there are at most

non-zeros in each row and column--and the task is to compute a uniformly sparse part of the product matrix. The sparsity structure is globally known in advance (this is the supported setting). As input, each computer receives one row of each input matrix, and each computer needs to output one row of the product matrix. In each communication round each computer can send and receive one

-bit message. Their algorithm solves this task in

rounds, while the trivial bound is

. We improve on the prior work in two dimensions: First, we show that we can solve the same task faster, in only

rounds. Second, we explore what happens when matrices are not uniformly sparse. We consider the following alternative notions of sparsity: row-sparse matrices (at most

non-zeros per row), column-sparse matrices, matrices with bounded degeneracy (we can recursively delete a row or column with at most

non-zeros), average-sparse matrices (at most

non-zeros in total), and general matrices.

Paper Structure (32 sections, 30 theorems, 4 equations, 1 figure, 4 tables)

This paper contains 32 sections, 30 theorems, 4 equations, 1 figure, 4 tables.

Introduction
Setting and prior work
Contribution 1: faster algorithm
Contribution 2: beyond uniform sparsity
Key conceptual message: role of bounded degeneracy
Related work and applications
Open questions for future work
Preliminaries
Supported model and indicator matrices
Tripartite graph and triangles
Clusters and clusterings
Handling few triangles fast
High-level plan
Virtual instance
Routing
...and 17 more sections

Key Result

Lemma 1

A clustered instance of matrix multiplication can be solved in $O(d^{4/3})$ rounds over semirings, and in $O(d^{1.156671})$ rounds over fields.

Figures (1)

Figure 1: Routing scheme from the proof of \ref{['lem:few-triangles']}

Theorems & Definitions (33)

Lemma 1
Lemma 2
Theorem 3: gupta-2022-sparse
Theorem 4
Lemma 5
Corollary 6
Corollary 7
Lemma 8
Lemma 9
Lemma 10
...and 23 more

Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

TL;DR

Abstract

Low-Bandwidth Matrix Multiplication: Faster Algorithms and More General Forms of Sparsity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (33)