Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations
Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse, Mathieu Verite
TL;DR
The paper develops tight, geometry-grounded lower bounds on data movement for symmetric BLAS kernels SYRK, SYR2K, and SYMM in both sequential and distributed-memory models. It extends the symmetric Loomis–Whitney inequality to relate the 3D iteration space of symmetric 3NL computations to 2D data accesses, then solves constrained optimization problems to obtain memory-dependent and memory-independent bounds. To match these bounds, it introduces triangle-block partitions of the lower triangle of the symmetric matrix, connecting to balanced clique partitions and Steiner systems via affine/projective geometric constructions. It then presents 1D, 2D, and 3D parallel algorithms (and limited-memory variants) that are communication-optimal and provide detailed analyses of memory, bandwidth, and computational costs, demonstrating leading constant-tightness with the lower bounds. The results generalize prior symmetric lower-bound approaches and offer a principled design framework for symmetry-exploiting matrix computations with practical implications for high-performance BLAS implementations.
Abstract
In this article, we focus on the communication costs of three symmetric matrix computations: i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.
