Table of Contents
Fetching ...

Serinv: A Scalable Library for the Selected Inversion of Block-Tridiagonal with Arrowhead Matrices

Vincent Maillou, Lisa Gaedke-Merzhaeuser, Alexandros Nikolaos Ziogas, Olaf Schenk, Mathieu Luisier

TL;DR

This work tackles the challenge of extracting only selected entries of the inverse for large, structured sparse matrices arising in climate modeling and materials science. It introduces Serinv, a distributed, GPU-accelerated library implementing block-Cholesky-based selected inversion for positive-definite BTA matrices, coupled with a parallel three-phase workflow that builds a reduced system to enable scalable inverse computation. The authors provide a thorough theoretical analysis of complexity, load balancing, and parallel efficiency, and validate the approach through extensive experiments on synthetic and INLA-derived datasets, achieving substantial speedups over PARDISO and MUMPS, and strong/weak scaling up to 16 GPUs (and beyond in some configurations). The results demonstrate practical impact for large-scale Bayesian inference and statistical modeling in earth sciences and nano-scale materials, offering a path to handling larger problems than prior CPU-centric or GPU-limited methods. Key contributions include: (i) a distributed, GPU-accelerated SIA for BTA matrices built on block-Cholesky factorization; (ii) a novel partitioning and permutation scheme to expose parallelism without physically permuting data; (iii) a reduced-system approach that enables efficient parallel selected inversion across partitions; and (iv) comprehensive theoretical and empirical comparisons showing competitive or superior performance and scalability relative to state-of-the-art solvers.

Abstract

The inversion of structured sparse matrices is a key but computationally and memory-intensive operation in many scientific applications. There are cases, however, where only particular entries of the full inverse are required. This has motivated the development of so-called selected-inversion algorithms, capable of computing only specific elements of the full inverse. Currently, most of them are either shared-memory codes or limited to CPU implementations. Here, we introduce Serinv, a scalable library providing distributed, GPU-based algorithms for the selected inversion and Cholesky decomposition of positive-definite, block-tridiagonal arrowhead matrices. This matrix class is highly relevant in statistical climate modeling and materials science applications. The performance of Serinv is demonstrated on synthetic and real datasets from statistical air temperature prediction models. In our numerical tests, Serinv achieves 32.3% strong and 47.2% weak scaling efficiency and up to two orders of magnitude speedup over the sparse direct solvers PARDISO and MUMPS on 16 GPUs.

Serinv: A Scalable Library for the Selected Inversion of Block-Tridiagonal with Arrowhead Matrices

TL;DR

This work tackles the challenge of extracting only selected entries of the inverse for large, structured sparse matrices arising in climate modeling and materials science. It introduces Serinv, a distributed, GPU-accelerated library implementing block-Cholesky-based selected inversion for positive-definite BTA matrices, coupled with a parallel three-phase workflow that builds a reduced system to enable scalable inverse computation. The authors provide a thorough theoretical analysis of complexity, load balancing, and parallel efficiency, and validate the approach through extensive experiments on synthetic and INLA-derived datasets, achieving substantial speedups over PARDISO and MUMPS, and strong/weak scaling up to 16 GPUs (and beyond in some configurations). The results demonstrate practical impact for large-scale Bayesian inference and statistical modeling in earth sciences and nano-scale materials, offering a path to handling larger problems than prior CPU-centric or GPU-limited methods. Key contributions include: (i) a distributed, GPU-accelerated SIA for BTA matrices built on block-Cholesky factorization; (ii) a novel partitioning and permutation scheme to expose parallelism without physically permuting data; (iii) a reduced-system approach that enables efficient parallel selected inversion across partitions; and (iv) comprehensive theoretical and empirical comparisons showing competitive or superior performance and scalability relative to state-of-the-art solvers.

Abstract

The inversion of structured sparse matrices is a key but computationally and memory-intensive operation in many scientific applications. There are cases, however, where only particular entries of the full inverse are required. This has motivated the development of so-called selected-inversion algorithms, capable of computing only specific elements of the full inverse. Currently, most of them are either shared-memory codes or limited to CPU implementations. Here, we introduce Serinv, a scalable library providing distributed, GPU-based algorithms for the selected inversion and Cholesky decomposition of positive-definite, block-tridiagonal arrowhead matrices. This matrix class is highly relevant in statistical climate modeling and materials science applications. The performance of Serinv is demonstrated on synthetic and real datasets from statistical air temperature prediction models. In our numerical tests, Serinv achieves 32.3% strong and 47.2% weak scaling efficiency and up to two orders of magnitude speedup over the sparse direct solvers PARDISO and MUMPS on 16 GPUs.

Paper Structure

This paper contains 26 sections, 8 figures, 4 tables, 6 algorithms.

Figures (8)

  • Figure 1: Symmetric, positive-definite, block-tridiagonal arrowhead (BTA) matrix resulting from statistical modeling applied to temperature prediction. The data was discretized on a 7-day time grid. The matrix is described by the number of main diagonal blocks$n$, their size $b$, and the arrow tip block size $a$.
  • Figure 2: Permutation scheme applied on a symmetric, positive-definite, matrix in order to perform its parallel factorization and selected-inversion. The matrix is distributed amongst three processes and permuted accordingly. The permutation-induced fill-in during the decomposition is shown in red hatches in the permuted matrix.
  • Figure 3: General organization of the distributed block-Cholesky factorization and selected inversion of a positive-definite BTA matrix. The method consists of three steps. a) and b) Parallel block-Cholesky factorization. c) and d) Creation of the reduced system $A_r$ and its selected inversion $X_r$. e) and f) Parallel selected inversion.
  • Figure 4: a) Theoretical maximum parallel efficiency of the complete selected-inversion procedure (PPOBTAF + POBTARSSI + PPOBTASI) as a function of the number of main diagonal blocks (horizontal axis) and processes (vertical axis) for BTA matrices with $b$=1024 and $a$=256. b) Experimental parallel efficiency of the complete selected-inversion procedure (PPOBTAF + POBTARSSI + PPOBTASI) using the theoretically determined ideal load balancing factor $r_{LB}$.
  • Figure 5: Theoretical FLOP count distributions for the GEMM, TRSM, and POTRF routines within the a) POBTAF and b) POBTASI algorithms. c) and d) report the corresponding breakdown based on actual runtime measurements for a BTA matrix with $b=1024$ and $a=256$ on an NVIDIA GH200 (GPU). We show the kernel performances in TFLOPS in parentheses.
  • ...and 3 more figures