Table of Contents
Fetching ...

Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures

Hansheng Wang, Ruiyi Zhan, Dajun Huang, Xingchen Liu, Qiao Li, Hancong Duan, Dingwen Tao, Guangming Tan, Shaoshuai Zhang

TL;DR

This work tackles the bottleneck of underutilization in large dense symmetric EVD on multi-GPU systems by introducing a pipelined two-stage EVD that overlaps SBR, BC, and back transformation. It replaces the traditional block-cyclic distribution with a blockwise scheme to enable the pipeline, reorders stages to decouple D&C and back transformations, and optimizes SBR, BC, and BC-Back (including a BLAS2-based BC-Back) to minimize communication and synchronization. The result is substantially improved performance and scalability, with mean speedups of up to 9.24× over strong baselines on modern GPUs and robust weak scaling across 1–8 GPUs. These gains enable efficient solution of very large eigenvalue problems in physics and chemistry, and the approach lays groundwork for even larger, future-scale EVD tasks and non-symmetric extensions.

Abstract

Large symmetric eigenvalue problems are commonly observed in many disciplines such as Chemistry and Physics, and several libraries including cuSOLVERMp, MAGMA and ELPA support computing large eigenvalue decomposition on multi-GPU or multi-CPU-GPU hybrid architectures. However, these libraries do not provide satisfied performance that all of the libraries only utilize around 1.5\% of the peak multi-GPU performance. In this paper, we propose a pipelined two-stage eigenvalue decomposition algorithm instead of conventional subsequent algorithm with substantial optimizations. On an 8$\times$A100 platform, our implementation surpasses state-of-the-art cuSOLVERMp and MAGMA baselines, delivering mean speedups of 5.74$\times$ and 6.59$\times$, with better strong and weak scalability.

Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures

TL;DR

This work tackles the bottleneck of underutilization in large dense symmetric EVD on multi-GPU systems by introducing a pipelined two-stage EVD that overlaps SBR, BC, and back transformation. It replaces the traditional block-cyclic distribution with a blockwise scheme to enable the pipeline, reorders stages to decouple D&C and back transformations, and optimizes SBR, BC, and BC-Back (including a BLAS2-based BC-Back) to minimize communication and synchronization. The result is substantially improved performance and scalability, with mean speedups of up to 9.24× over strong baselines on modern GPUs and robust weak scaling across 1–8 GPUs. These gains enable efficient solution of very large eigenvalue problems in physics and chemistry, and the approach lays groundwork for even larger, future-scale EVD tasks and non-symmetric extensions.

Abstract

Large symmetric eigenvalue problems are commonly observed in many disciplines such as Chemistry and Physics, and several libraries including cuSOLVERMp, MAGMA and ELPA support computing large eigenvalue decomposition on multi-GPU or multi-CPU-GPU hybrid architectures. However, these libraries do not provide satisfied performance that all of the libraries only utilize around 1.5\% of the peak multi-GPU performance. In this paper, we propose a pipelined two-stage eigenvalue decomposition algorithm instead of conventional subsequent algorithm with substantial optimizations. On an 8A100 platform, our implementation surpasses state-of-the-art cuSOLVERMp and MAGMA baselines, delivering mean speedups of 5.74 and 6.59, with better strong and weak scalability.

Paper Structure

This paper contains 22 sections, 11 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: The first two iterations in SBR DBBR.
  • Figure 2: The illustration of BC-Back transformation multiplies $Q_d$.
  • Figure 3: The illustration of BC-Back transformation multiplies $Q_d$.
  • Figure 4: The timeline of MAGMA two-stage EVD routine with matrix size $49152\times 49152$ on 4 A100 GPUs.
  • Figure 5: The difference between the block cyclic distribution and the blockwise distribution.
  • ...and 11 more figures