Table of Contents
Fetching ...

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse, Mathieu Verite

TL;DR

The paper develops tight, geometry-grounded lower bounds on data movement for symmetric BLAS kernels SYRK, SYR2K, and SYMM in both sequential and distributed-memory models. It extends the symmetric Loomis–Whitney inequality to relate the 3D iteration space of symmetric 3NL computations to 2D data accesses, then solves constrained optimization problems to obtain memory-dependent and memory-independent bounds. To match these bounds, it introduces triangle-block partitions of the lower triangle of the symmetric matrix, connecting to balanced clique partitions and Steiner systems via affine/projective geometric constructions. It then presents 1D, 2D, and 3D parallel algorithms (and limited-memory variants) that are communication-optimal and provide detailed analyses of memory, bandwidth, and computational costs, demonstrating leading constant-tightness with the lower bounds. The results generalize prior symmetric lower-bound approaches and offer a principled design framework for symmetry-exploiting matrix computations with practical implications for high-performance BLAS implementations.

Abstract

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations

TL;DR

The paper develops tight, geometry-grounded lower bounds on data movement for symmetric BLAS kernels SYRK, SYR2K, and SYMM in both sequential and distributed-memory models. It extends the symmetric Loomis–Whitney inequality to relate the 3D iteration space of symmetric 3NL computations to 2D data accesses, then solves constrained optimization problems to obtain memory-dependent and memory-independent bounds. To match these bounds, it introduces triangle-block partitions of the lower triangle of the symmetric matrix, connecting to balanced clique partitions and Steiner systems via affine/projective geometric constructions. It then presents 1D, 2D, and 3D parallel algorithms (and limited-memory variants) that are communication-optimal and provide detailed analyses of memory, bandwidth, and computational costs, demonstrating leading constant-tightness with the lower bounds. The results generalize prior symmetric lower-bound approaches and offer a principled design framework for symmetry-exploiting matrix computations with practical implications for high-performance BLAS implementations.

Abstract

In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.
Paper Structure (62 sections, 25 theorems, 62 equations, 6 figures, 4 tables, 18 algorithms)

This paper contains 62 sections, 25 theorems, 62 equations, 6 figures, 4 tables, 18 algorithms.

Key Result

Lemma 1

Let $V$ be a finite set of points in $\mathbb{Z}^3$. Let $\phi_i(V)$ be the projection of $V$ in the $i$-direction, i.e. all points $(j,k)$ such that there exists an $i$ so that $(i,j,k) \in V$. Define $\phi_j(V)$ and $\phi_k(V)$ similarly. Then where $|\cdot|$ denotes the cardinality of a set.

Figures (6)

  • Figure 1: Triangle block partition for $n_1=16$ and $|R_k|=4$. Triangle blocks for $R_{3}$ and $R_{17}$ are highlighted to illustrate both non-contiguous and contiguous triangle blocks.
  • Figure 2: The triangle block partitions of the lower triangle defined by the affine and projective constructions for $n_1=9$, $c=3$ and $n_1=13$, $c=3$, respectively. The affine and projective constructions have 12 and 13 triangle blocks, respectively. Each entry of the lower triangle is marked with a triangle block from which it belongs. For example, $(5,0),(7,0)$ and $(7,5)$ entries belong to the $2$nd triangle block in the affine construction. Diagonal elements are assigned in a compatible way with the triangle blocks.
  • Figure 3: Triangle block partition using the affine construction for SYMM ($\mathbf{C}\mathrel{+}=\mathbf{A}\mathbf{B}$) with $c=4$. Segments $0\leq k < c^2+c$ are shown in blue to indicate ownership of an element and element indices $0\leq i < c^2$ are shown in red. Each row of $\mathbf{B}$ and each row of $\mathbf{C}$ are required for all the $c+1$ segments listed in the row.
  • Figure 4: Triangle block distribution using the affine construction for SYMM of $\mathbf{A}$, $\mathbf{B}$ and $\mathbf{C}$ with $c=3$, $P=12$. Processor ranks $0\leq k < P$ are shown in blue to indicate ownership of a block and block indices $0\leq i < c^2$ are shown in red. Distribution of each row block of $\mathbf{B}$ and each row block of $\mathbf{C}$ among their $c+1$ processors is arbitrary as long as it is even.
  • Figure 5: Triangle block distribution for $c=4$, $c^2+c+1=21$ segments using the projective construction. Diagonal elements are assigned using a greedy procedure.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Lemma 1: Loomis-Whitney LW49
  • Lemma 2: ABGKR23
  • Lemma 3
  • Lemma 4
  • Theorem 1
  • Theorem 2
  • Corollary 3
  • Corollary 4
  • Corollary 5
  • Corollary 6
  • ...and 23 more