A mixed precision LOBPCG algorithm

Daniel Kressner; Yuxin Ma; Meiyue Shao

A mixed precision LOBPCG algorithm

Daniel Kressner, Yuxin Ma, Meiyue Shao

TL;DR

This work addresses efficiently computing a few smallest eigenpairs of a large Hermitian positive definite matrix $A$ by introducing a mixed-precision LOBPCG framework. It combines a reduced-precision (sparse) Cholesky preconditioner with mixed-precision orthogonalization and a two-stage workflow to obtain high-accuracy solutions with reduced cost. The authors provide a finite-precision convergence analysis showing that rounding errors in the preconditioner have only a marginal effect on convergence, and they demonstrate substantial speedups (up to roughly $2\times$ on CPUs/GPUs) in sparse and dense settings, including complex kernels. The approach significantly accelerates eigenvalue computations in practical applications while preserving accuracy, enabling more scalable large-scale eigenvalue problems.

Abstract

The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix A. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of A computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of 1.4--2.0 on both CPUs and GPUs.

A mixed precision LOBPCG algorithm

TL;DR

This work addresses efficiently computing a few smallest eigenpairs of a large Hermitian positive definite matrix

by introducing a mixed-precision LOBPCG framework. It combines a reduced-precision (sparse) Cholesky preconditioner with mixed-precision orthogonalization and a two-stage workflow to obtain high-accuracy solutions with reduced cost. The authors provide a finite-precision convergence analysis showing that rounding errors in the preconditioner have only a marginal effect on convergence, and they demonstrate substantial speedups (up to roughly

on CPUs/GPUs) in sparse and dense settings, including complex kernels. The approach significantly accelerates eigenvalue computations in practical applications while preserving accuracy, enabling more scalable large-scale eigenvalue problems.

Abstract

Paper Structure (14 sections, 3 theorems, 39 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 14 sections, 3 theorems, 39 equations, 6 figures, 3 tables, 3 algorithms.

Introduction
LOBPCG algorithm
Mixed precision algorithms
Lower precision preconditioning
A mixed precision orthogonalization procedure
A mixed precision LOBPCG algorithm
Convergence in finite-precision arithmetic
Numerical experiments
Experiment settings
Advantage of mixed precision orthogonalization
Tests for sparse matrices
Tests for dense matrices
Tests on different GPUs
Conclusion

Key Result

Lemma 1

Let $\hat{r}_i$ denote the result of evaluating $r_i:=Ax_i - \rho(x_i)x_i$ in working precision. Assuming that (eq:er-Ax-ge) holds, there exist a symmetric matrix $F\in\mathbb{R}^{n\times n}$ and a diagonal matrix $E\in\mathbb{R}^{n\times n}$ such that where $\lVert E\rVert_2\leq \bm u_h$ and $\lVert F\rVert_2\leq \epsilon_r\lVert A\rVert_2$ with

Figures (6)

Figure 1: Tests for real sparse matrices on CPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
Figure 2: Tests for real sparse matrices on GPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
Figure 3: Tests for real dense kernel matrices on CPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
Figure 4: Tests for real dense kernel matrices on GPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
Figure 5: Tests for complex dense kernel matrices on GPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
...and 1 more figures

Theorems & Definitions (7)

Lemma 1
proof
Theorem 2
proof
Remark
Lemma 3
proof

A mixed precision LOBPCG algorithm

TL;DR

Abstract

A mixed precision LOBPCG algorithm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)