Table of Contents
Fetching ...

Parallelization and scalability analysis of inverse factorization using the Chunks and Tasks programming model

Anton G. Artemov, Elias Rudberg, Emanuel H. Rubensson

TL;DR

The paper develops and analyzes three distributed-memory inverse factorization strategies (RINCH, IRSI, and LIF) for block-sparse SPD matrices, tailored to overlap matrices in electronic structure calculations. Implemented in Chunks and Tasks with a quad-tree representation, the methods achieve scalable performance, with LIF offering reduced communication and favorable weak-scaling behavior compared to prior approaches. The authors provide theoretical critical-path analyses and extensive experiments on large Gaussian-basis systems, showing linear or near-linear scaling with system size and emphasizing the practical impact for enabling large-scale electronic structure computations. The work demonstrates how a task-based runtime and data-locality-aware sparse operations can unlock efficient parallel inverse factorizations for hierarchical matrix representations.

Abstract

We present three methods for distributed memory parallel inverse factorization of block-sparse Hermitian positive definite matrices. The three methods are a recursive variant of the AINV inverse Cholesky algorithm, iterative refinement, and localized inverse factorization, respectively. All three methods are implemented using the Chunks and Tasks programming model, building on the distributed sparse quad-tree matrix representation and parallel matrix-matrix multiplication in the publicly available Chunks and Tasks Matrix Library (CHTML). Although the algorithms are generally applicable, this work was mainly motivated by the need for efficient and scalable inverse factorization of the basis set overlap matrix in large scale electronic structure calculations. We perform various computational tests on overlap matrices for quasi-linear Glutamic Acid-Alanine molecules and three-dimensional water clusters discretized using the standard Gaussian basis set STO-3G with up to more than 10 million basis functions. We show that for such matrices the computational cost increases only linearly with system size for all the three methods. We show both theoretically and in numerical experiments that the methods based on iterative refinement and localized inverse factorization outperform previous parallel implementations in weak scaling tests where the system size is increased in direct proportion to the number of processes. We show also that compared to the method based on pure iterative refinement the localized inverse factorization requires much less communication.

Parallelization and scalability analysis of inverse factorization using the Chunks and Tasks programming model

TL;DR

The paper develops and analyzes three distributed-memory inverse factorization strategies (RINCH, IRSI, and LIF) for block-sparse SPD matrices, tailored to overlap matrices in electronic structure calculations. Implemented in Chunks and Tasks with a quad-tree representation, the methods achieve scalable performance, with LIF offering reduced communication and favorable weak-scaling behavior compared to prior approaches. The authors provide theoretical critical-path analyses and extensive experiments on large Gaussian-basis systems, showing linear or near-linear scaling with system size and emphasizing the practical impact for enabling large-scale electronic structure computations. The work demonstrates how a task-based runtime and data-locality-aware sparse operations can unlock efficient parallel inverse factorizations for hierarchical matrix representations.

Abstract

We present three methods for distributed memory parallel inverse factorization of block-sparse Hermitian positive definite matrices. The three methods are a recursive variant of the AINV inverse Cholesky algorithm, iterative refinement, and localized inverse factorization, respectively. All three methods are implemented using the Chunks and Tasks programming model, building on the distributed sparse quad-tree matrix representation and parallel matrix-matrix multiplication in the publicly available Chunks and Tasks Matrix Library (CHTML). Although the algorithms are generally applicable, this work was mainly motivated by the need for efficient and scalable inverse factorization of the basis set overlap matrix in large scale electronic structure calculations. We perform various computational tests on overlap matrices for quasi-linear Glutamic Acid-Alanine molecules and three-dimensional water clusters discretized using the standard Gaussian basis set STO-3G with up to more than 10 million basis functions. We show that for such matrices the computational cost increases only linearly with system size for all the three methods. We show both theoretically and in numerical experiments that the methods based on iterative refinement and localized inverse factorization outperform previous parallel implementations in weak scaling tests where the system size is increased in direct proportion to the number of processes. We show also that compared to the method based on pure iterative refinement the localized inverse factorization requires much less communication.

Paper Structure

This paper contains 19 sections, 24 equations, 7 figures, 1 table, 5 algorithms.

Figures (7)

  • Figure 1: Left panel: Scaling with system size of the RINCH, IRSI and LIF algorithms for Glu-Ala helices of increasing length. The tests were made for 3 and 48 processes involved. Right panel: numbers of non-zero elements per row in the corresponding inverse factors and in the original overlap matrix $S.$
  • Figure 2: Left panel: Scaling with system size of the RINCH, IRSI and LIF algorithms for water clusters of increasing size. The tests were made for 12 and 192 processes involved. Right panel: numbers of non-zero elements per row in the corresponding inverse factors and in the original overlap matrix $S.$
  • Figure 3: Left panel: Strong scaling of the RINCH, IRSI and LIF algorithms for a Glu-Ala helix containing 1703938 atoms, which gave a system size of 5373954 basis functions. Right panel: strong scaling of the RINCH, IRSI and LIF algorithms for a water cluster containing 432498 atoms, which gave a system size of 1009162 basis functions. For both cases, the number of processes was doubled each time while the system size was kept the same.
  • Figure 4: Left panel: Approximate weak scaling of the RINCH, IRSI, and LIF algorithms for Glu-Ala helices of increasing length. The number of basis functions per process was approximately fixed to $112 \times 10^3$, so that the system size is scaled up together with the number of processes. Right panel: Critical path length as reported by the CHT-MPI library, defined as the largest number of tasks that have to be executed serially. The dashed and dashed-dotted help lines show $c_0 + c_1 \log(N) + c_2 \log^2(N) + c_3 \log^3(N)$ least squares fits for IRSI and LIF, respectively. The data in the left panel corresponds to the 5 rightmost points in the right plot.
  • Figure 5: Left panel: Approximate weak scaling of the RINCH, IRSI, and LIF algorithms for water clusters of increasing length. The number of basis functions per process was approximately fixed to $84 \times 10^3$, so that the system size is scaled up together with the number of processes. Right panel: Critical path length as reported by the CHT-MPI library, defined as the largest number of tasks that have to be executed serially. The dashed and dashed-dotted help lines show $c_0 + c_1 \log(N) + c_2 \log^2(N) + c_3 \log^3(N)$ least squares fits for IRSI and LIF, respectively. The data in the left panel corresponds to the 7 rightmost points in the right plot.
  • ...and 2 more figures