Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

Hussam Al Daas; Grey Ballard; Laura Grigori; Suraj Kumar; Kathryn Rouse; Mathieu Verite

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse, Mathieu Verite

TL;DR

The paper develops tight, geometry-grounded lower bounds on data movement for symmetric BLAS kernels SYRK, SYR2K, and SYMM in both sequential and distributed-memory models. It extends the symmetric Loomis–Whitney inequality to relate the 3D iteration space of symmetric 3NL computations to 2D data accesses, then solves constrained optimization problems to obtain memory-dependent and memory-independent bounds. To match these bounds, it introduces triangle-block partitions of the lower triangle of the symmetric matrix, connecting to balanced clique partitions and Steiner systems via affine/projective geometric constructions. It then presents 1D, 2D, and 3D parallel algorithms (and limited-memory variants) that are communication-optimal and provide detailed analyses of memory, bandwidth, and computational costs, demonstrating leading constant-tightness with the lower bounds. The results generalize prior symmetric lower-bound approaches and offer a principled design framework for symmetry-exploiting matrix computations with practical implications for high-performance BLAS implementations.

Abstract

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

TL;DR

Abstract

Paper Structure (62 sections, 25 theorems, 62 equations, 6 figures, 4 tables, 18 algorithms)

This paper contains 62 sections, 25 theorems, 62 equations, 6 figures, 4 tables, 18 algorithms.

Introduction
Related Work
Preliminaries
Symmetric Atomic Three Nested Loop Algorithms
Computation Models
Sequential Computation Model
Parallel Computation Model
Collective Communication Costs
Fundamental Results
Memory Dependent Lower Bound Results
Key Optimization Problem
Sequential Lower Bounds
Parallel Lower Bounds
Memory Independent Parallel Lower Bound Results
Key Optimization Problem
...and 47 more sections

Key Result

Lemma 1

Let $V$ be a finite set of points in $\mathbb{Z}^3$. Let $\phi_i(V)$ be the projection of $V$ in the $i$-direction, i.e. all points $(j,k)$ such that there exists an $i$ so that $(i,j,k) \in V$. Define $\phi_j(V)$ and $\phi_k(V)$ similarly. Then where $|\cdot|$ denotes the cardinality of a set.

Figures (6)

Figure 1: Triangle block partition for $n_1=16$ and $|R_k|=4$. Triangle blocks for $R_{3}$ and $R_{17}$ are highlighted to illustrate both non-contiguous and contiguous triangle blocks.
Figure 2: The triangle block partitions of the lower triangle defined by the affine and projective constructions for $n_1=9$, $c=3$ and $n_1=13$, $c=3$, respectively. The affine and projective constructions have 12 and 13 triangle blocks, respectively. Each entry of the lower triangle is marked with a triangle block from which it belongs. For example, $(5,0),(7,0)$ and $(7,5)$ entries belong to the $2$nd triangle block in the affine construction. Diagonal elements are assigned in a compatible way with the triangle blocks.
Figure 3: Triangle block partition using the affine construction for SYMM ($\mathbf{C}\mathrel{+}=\mathbf{A}\mathbf{B}$) with $c=4$. Segments $0\leq k < c^2+c$ are shown in blue to indicate ownership of an element and element indices $0\leq i < c^2$ are shown in red. Each row of $\mathbf{B}$ and each row of $\mathbf{C}$ are required for all the $c+1$ segments listed in the row.
Figure 4: Triangle block distribution using the affine construction for SYMM of $\mathbf{A}$, $\mathbf{B}$ and $\mathbf{C}$ with $c=3$, $P=12$. Processor ranks $0\leq k < P$ are shown in blue to indicate ownership of a block and block indices $0\leq i < c^2$ are shown in red. Distribution of each row block of $\mathbf{B}$ and each row block of $\mathbf{C}$ among their $c+1$ processors is arbitrary as long as it is even.
Figure 5: Triangle block distribution for $c=4$, $c^2+c+1=21$ segments using the projective construction. Diagonal elements are assigned using a greedy procedure.
...and 1 more figures

Theorems & Definitions (33)

Lemma 1: Loomis-Whitney LW49
Lemma 2: ABGKR23
Lemma 3
Lemma 4
Theorem 1
Theorem 2
Corollary 3
Corollary 4
Corollary 5
Corollary 6
...and 23 more

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

TL;DR

Abstract

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (33)