Table of Contents
Fetching ...

A mixed precision LOBPCG algorithm

Daniel Kressner, Yuxin Ma, Meiyue Shao

TL;DR

This work addresses efficiently computing a few smallest eigenpairs of a large Hermitian positive definite matrix $A$ by introducing a mixed-precision LOBPCG framework. It combines a reduced-precision (sparse) Cholesky preconditioner with mixed-precision orthogonalization and a two-stage workflow to obtain high-accuracy solutions with reduced cost. The authors provide a finite-precision convergence analysis showing that rounding errors in the preconditioner have only a marginal effect on convergence, and they demonstrate substantial speedups (up to roughly $2\times$ on CPUs/GPUs) in sparse and dense settings, including complex kernels. The approach significantly accelerates eigenvalue computations in practical applications while preserving accuracy, enabling more scalable large-scale eigenvalue problems.

Abstract

The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix A. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of A computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of 1.4--2.0 on both CPUs and GPUs.

A mixed precision LOBPCG algorithm

TL;DR

This work addresses efficiently computing a few smallest eigenpairs of a large Hermitian positive definite matrix by introducing a mixed-precision LOBPCG framework. It combines a reduced-precision (sparse) Cholesky preconditioner with mixed-precision orthogonalization and a two-stage workflow to obtain high-accuracy solutions with reduced cost. The authors provide a finite-precision convergence analysis showing that rounding errors in the preconditioner have only a marginal effect on convergence, and they demonstrate substantial speedups (up to roughly on CPUs/GPUs) in sparse and dense settings, including complex kernels. The approach significantly accelerates eigenvalue computations in practical applications while preserving accuracy, enabling more scalable large-scale eigenvalue problems.

Abstract

The locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm is a popular approach for computing a few smallest eigenvalues and the corresponding eigenvectors of a large Hermitian positive definite matrix A. In this work, we propose a mixed precision variant of LOBPCG that uses a (sparse) Cholesky factorization of A computed in reduced precision as the preconditioner. To further enhance performance, a mixed precision orthogonalization strategy is proposed. To analyze the impact of reducing precision in the preconditioner on performance, we carry out a rounding error and convergence analysis of PINVIT, a simplified variant of LOBPCG. Our theoretical results predict and our numerical experiments confirm that the impact on convergence remains marginal. In practice, our mixed precision LOBPCG algorithm typically reduces the computation time by a factor of 1.4--2.0 on both CPUs and GPUs.
Paper Structure (14 sections, 3 theorems, 39 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 14 sections, 3 theorems, 39 equations, 6 figures, 3 tables, 3 algorithms.

Key Result

Lemma 1

Let $\hat{r}_i$ denote the result of evaluating $r_i:=Ax_i - \rho(x_i)x_i$ in working precision. Assuming that (eq:er-Ax-ge) holds, there exist a symmetric matrix $F\in\mathbb{R}^{n\times n}$ and a diagonal matrix $E\in\mathbb{R}^{n\times n}$ such that where $\lVert E\rVert_2\leq \bm u_h$ and $\lVert F\rVert_2\leq \epsilon_r\lVert A\rVert_2$ with

Figures (6)

  • Figure 1: Tests for real sparse matrices on CPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
  • Figure 2: Tests for real sparse matrices on GPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
  • Figure 3: Tests for real dense kernel matrices on CPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
  • Figure 4: Tests for real dense kernel matrices on GPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
  • Figure 5: Tests for complex dense kernel matrices on GPU. For each matrix, the three columns from left to right represent the result of DLOBPCG-dchol, DLOBPCG-schol, and MPLOBPCG-schol, respectively.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • Remark
  • Lemma 3
  • proof