Table of Contents
Fetching ...

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

David Jin, Alexis Montoison, Sungho Shin

TL;DR

This work presents a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems, employing a recursive Schur-complement reduction to transform the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines.

Abstract

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of estimators and controllers. We present a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems. Our method employs a recursive Schur-complement reduction, transforming the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines. Performance benchmarks with our cross-platform (NVIDIA and AMD) implementation, BlockDSS, show substantial speed-ups over state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. At the same time, the current implementation still invokes batched routines sequentially at each recursion level, and high efficiency requires block sizes large enough to amortize kernel launch overhead.

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

TL;DR

This work presents a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems, employing a recursive Schur-complement reduction to transform the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines.

Abstract

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of estimators and controllers. We present a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems. Our method employs a recursive Schur-complement reduction, transforming the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines. Performance benchmarks with our cross-platform (NVIDIA and AMD) implementation, BlockDSS, show substantial speed-ups over state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. At the same time, the current implementation still invokes batched routines sequentially at each recursion level, and high efficiency requires block sizes large enough to amortize kernel launch overhead.

Paper Structure