Generating coupled cluster code for modern distributed memory tensor software
Jan Brandejs, Johann Pototschnig, Trond Saue
TL;DR
This work addresses the challenge of delivering scalable coupled-cluster (CC) computations on modern GPU-based HPC platforms by introducing tenpi, an open-source code generator that translates high-level CC representations into optimized, distributed-memory code. It combines Kucharski–Bartlett diagram-based derivation, automatic intermediate optimization, and backend integration with tensor libraries like ExaTENSOR to produce Fortran 2008 code suitable for GPU clusters, achieving strong and weak scaling up to 1200 GPUs. The authors validate correctness against established CC implementations for small molecules and demonstrate scalable performance on UF$_6$ and CO, highlighting the practicality of automated CC code generation in heterogeneous environments. The results establish tenpi as a flexible, production-oriented pathway for higher-order CC methods, while outlining future work on symmetry handling, CCSD(T) extensions, and cost-model refinement to further boost efficiency and applicability.
Abstract
Using GPU-based HPC platforms efficiently for coupled cluster computations is a challenge due to heterogeneous hardware structures. The constant need to adapt software to these structures and the required man-hours makes a systematization of high-performance code development desirable, even more so for higher-order coupled cluster. This is generally achieved by introducing a high-level representation of the problem, which is then translated to low-level instructions for the hardware using a compiler/translator component. Designing such software comes with another challenge: Allowing efficient implementation by capturing key symmetries of tensors, while retaining the abstraction from the hardware. We review ways to address these two challenges while presenting design decisions which led us to the development of a general-order coupled cluster code generator. The systematically produced code shows excellent weak scaling behavior running on up to 1200 GPUs using the distributed memory tensor library ExaTENSOR. We present an open-source modular tensor framework "tenpi" for coupled cluster code development with diagrammatic derivation, visualization module, symbolic algebra, intermediate optimization and support for multiple tensor backends. Tenpi brings higher-order CC functionality to the massively parallel ExaCorr module of the DIRAC code for relativistic molecular calculations.
