Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Xiaoye Sherry Li; Yang Liu

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Xiaoye Sherry Li, Yang Liu

TL;DR

The paper surveys parallel sparse direct solvers on modern HPC architectures, focusing on reducing data movement via communication-avoiding strategies and lowering arithmetic/memory costs through data-sparse, rank-structured (notably $\mathcal{H}$, $\mathcal{H}^2$, HSS, HODLR) representations. It covers algorithmic frameworks (3D CA, DAG/tree-based scheduling), GPU-accelerated implementations, and hybrid structure/data-sparse solvers that combine frontal matrices with compressed blocks. Practical aspects include preprocessing, construction, factorization, and solve phases, along with distributed-memory layouts and batching to harness fine-grained parallelism on CPUs/GPUs. The article also catalogs software packages and delineates open problems in GPU-resident solvers, symmetric indefinite solves, and theoretical analyses of data-sparse methods. Overall, it presents a comprehensive view of advancing scalable, robust direct solvers for large-scale, ill-conditioned systems arising in PDEs, integral equations, and kernel-based computations.

Abstract

Efficient solutions of large-scale, ill-conditioned and indefinite algebraic equations are ubiquitously needed in numerous computational fields, including multiphysics simulations, machine learning, and data science. Because of their robustness and accuracy, direct solvers are crucial components in building a scalable solver toolchain. In this article, we will review recent advances of sparse direct solvers along two axes: 1) reducing communication and latency costs in both task- and data-parallel settings, and 2) reducing computational complexity via low-rank and other compression techniques such as hierarchical matrix algebra. In addition to algorithmic principles, we also illustrate the key parallelization challenges and best practices to deliver high speed and reliability on modern heterogeneous parallel machines.

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

TL;DR

, HSS, HODLR) representations. It covers algorithmic frameworks (3D CA, DAG/tree-based scheduling), GPU-accelerated implementations, and hybrid structure/data-sparse solvers that combine frontal matrices with compressed blocks. Practical aspects include preprocessing, construction, factorization, and solve phases, along with distributed-memory layouts and batching to harness fine-grained parallelism on CPUs/GPUs. The article also catalogs software packages and delineates open problems in GPU-resident solvers, symmetric indefinite solves, and theoretical analyses of data-sparse methods. Overall, it presents a comprehensive view of advancing scalable, robust direct solvers for large-scale, ill-conditioned systems arising in PDEs, integral equations, and kernel-based computations.

Abstract

Paper Structure (33 sections, 12 equations, 15 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 12 equations, 15 figures, 6 tables, 1 algorithm.

Introduction
Problem statement
Challenges and motivations
Sources of parallelism
Parallel architectures
Parallel algorithms in matrix computation
Analysis of parallel algorithms
Mitigating Communication and Synchronization Costs
Synchronization-avoiding via one-sided communication
Communication-avoiding 3D algorithm framework
Communication-avoiding 3D SpTRSV
Communication lower bounds
Uncovering Fine-grain Parallelism for GPU Acceleration
Batching
Acceleration via Data-sparse Compression
...and 18 more sections

Figures (15)

Figure 1: Sketch of the right-looking and multifrontal algorithms. We define $R \subseteq \{K : N\}$ consisting of indices corresponding to nonzeros of LU, and $R_1=R \setminus \{K\}$. In Listing (2), $T_i$ represents node $i$'s update matrix corresponding to the partial Schur complement update. The two index sets $R$ and $R_1$ change after each step of GE, reflecting the Schur complement sparsity after each elimination step.
Figure 2: Illustration of level set in a lower triangular SpTRSV, $L x = b$.
Figure 3: The view of the logical 3D process grid and an example of 18 processes arranged as a 3x3x2 process grid.
Figure 4: Two-level etree partition and the matrix view of the submatrix mapping to four 2D process grids.
Figure 5: Asymptotic per process communication volumes given in Tables \ref{['tab:asympt2d']} and \ref{['tab:asympt3d']}, with different $P_z$ settings.
...and 10 more figures

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

TL;DR

Abstract

Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Authors

TL;DR

Abstract

Table of Contents

Figures (15)