Solvers for the Hermitian and the pseudo-Hermitian Bethe-Salpeter equation in the Yambo code: Implementation and Performance
Petru Milev, Blanca Mellado-Pinto, Muralidhar Nalabothula, Ali Esquembre Kucukalic, Fernando Alvarruiz, Enrique Ramos, Francesco Filippone, Alejandro Molina-Sanchez, Ludger Wirtz, Jose E. Roman, Davide Sangalli
TL;DR
This work addresses solving the Bethe–Salpeter equation (BSE) as a structured eigenproblem by evaluating two solver paradigms—exact diagonalization and iterative SLEPc methods—implemented in the Yambo code and interfaced with ScaLAPACK, ELPA, and SLEPc. It exploits the $ abla$-pseudo‑Hermitian structure via the $\\ ext{Omega}$ operator to transform the coupling case into efficiently solvable forms, achieving substantial speedups and memory benefits. The study provides detailed CPU and GPU performance analyses up to matrices with $N \approx 10^5$, demonstrating that pseudo‑Hermitian solvers can render the coupling case nearly as efficient as the resonant case and that library‑based solvers can overcome the solver barrier for large BSE matrices. The findings have practical impact by enabling large‑scale optical property calculations in condensed matter systems, with concrete guidance on when to prefer direct diagonalization, iterative methods, and PH‑aware implementations. The work also outlines integration strategies with multiple HPC libraries and highlights future prospects for magma/cuSolver integrations.
Abstract
We analyze the performance of two strategies in solving the structured eigenvalue problem deriving from the Bethe-Salpeter equation (BSE) in condensed matter physics. The BSE matrix is constructed with the Yambo code, and the two strategies are implemented by interfacing Yambo with the ScaLAPACK and ELPA libraries for direct diagonalization, and with the SLEPc library for the iterative approach. We consider both the Hermitian (Tamm-Dancoff approximation) and pseudo-Hermitian forms, addressing dense matrices of three different sizes. A description of the implementation is also provided, with details for the pseudo-Hermitian case. Timing and memory utilization are analyzed on both CPU and GPU clusters. Our results demonstrate that it is now feasible to handle dense BSE matrices of the order of 10^5.
