Table of Contents
Fetching ...

Generating coupled cluster code for modern distributed memory tensor software

Jan Brandejs, Johann Pototschnig, Trond Saue

TL;DR

This work addresses the challenge of delivering scalable coupled-cluster (CC) computations on modern GPU-based HPC platforms by introducing tenpi, an open-source code generator that translates high-level CC representations into optimized, distributed-memory code. It combines Kucharski–Bartlett diagram-based derivation, automatic intermediate optimization, and backend integration with tensor libraries like ExaTENSOR to produce Fortran 2008 code suitable for GPU clusters, achieving strong and weak scaling up to 1200 GPUs. The authors validate correctness against established CC implementations for small molecules and demonstrate scalable performance on UF$_6$ and CO, highlighting the practicality of automated CC code generation in heterogeneous environments. The results establish tenpi as a flexible, production-oriented pathway for higher-order CC methods, while outlining future work on symmetry handling, CCSD(T) extensions, and cost-model refinement to further boost efficiency and applicability.

Abstract

Using GPU-based HPC platforms efficiently for coupled cluster computations is a challenge due to heterogeneous hardware structures. The constant need to adapt software to these structures and the required man-hours makes a systematization of high-performance code development desirable, even more so for higher-order coupled cluster. This is generally achieved by introducing a high-level representation of the problem, which is then translated to low-level instructions for the hardware using a compiler/translator component. Designing such software comes with another challenge: Allowing efficient implementation by capturing key symmetries of tensors, while retaining the abstraction from the hardware. We review ways to address these two challenges while presenting design decisions which led us to the development of a general-order coupled cluster code generator. The systematically produced code shows excellent weak scaling behavior running on up to 1200 GPUs using the distributed memory tensor library ExaTENSOR. We present an open-source modular tensor framework "tenpi" for coupled cluster code development with diagrammatic derivation, visualization module, symbolic algebra, intermediate optimization and support for multiple tensor backends. Tenpi brings higher-order CC functionality to the massively parallel ExaCorr module of the DIRAC code for relativistic molecular calculations.

Generating coupled cluster code for modern distributed memory tensor software

TL;DR

This work addresses the challenge of delivering scalable coupled-cluster (CC) computations on modern GPU-based HPC platforms by introducing tenpi, an open-source code generator that translates high-level CC representations into optimized, distributed-memory code. It combines Kucharski–Bartlett diagram-based derivation, automatic intermediate optimization, and backend integration with tensor libraries like ExaTENSOR to produce Fortran 2008 code suitable for GPU clusters, achieving strong and weak scaling up to 1200 GPUs. The authors validate correctness against established CC implementations for small molecules and demonstrate scalable performance on UF and CO, highlighting the practicality of automated CC code generation in heterogeneous environments. The results establish tenpi as a flexible, production-oriented pathway for higher-order CC methods, while outlining future work on symmetry handling, CCSD(T) extensions, and cost-model refinement to further boost efficiency and applicability.

Abstract

Using GPU-based HPC platforms efficiently for coupled cluster computations is a challenge due to heterogeneous hardware structures. The constant need to adapt software to these structures and the required man-hours makes a systematization of high-performance code development desirable, even more so for higher-order coupled cluster. This is generally achieved by introducing a high-level representation of the problem, which is then translated to low-level instructions for the hardware using a compiler/translator component. Designing such software comes with another challenge: Allowing efficient implementation by capturing key symmetries of tensors, while retaining the abstraction from the hardware. We review ways to address these two challenges while presenting design decisions which led us to the development of a general-order coupled cluster code generator. The systematically produced code shows excellent weak scaling behavior running on up to 1200 GPUs using the distributed memory tensor library ExaTENSOR. We present an open-source modular tensor framework "tenpi" for coupled cluster code development with diagrammatic derivation, visualization module, symbolic algebra, intermediate optimization and support for multiple tensor backends. Tenpi brings higher-order CC functionality to the massively parallel ExaCorr module of the DIRAC code for relativistic molecular calculations.
Paper Structure (17 sections, 5 equations, 11 figures, 4 tables)

This paper contains 17 sections, 5 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Summit node structure. This figure is without copyright and is used after explicit consent by OLCF.SummitManual
  • Figure 2: The string representation of CC diagrams, an example. The three consecutive zeros are to leave space for one further admissible $\hat{T}$ operator. Note that the triplets of integers corresponding to $\hat{T}$ operators are ordered to assure uniqueness of the diagrams. See Ref. Kallay2001 for a full explanation. Left: Visual representation of a diagram and an equation shown as printed from tenpi.
  • Figure 3: A simple algorithm to generate diagram strings as shown in Fig. \ref{['fig:sequence']} for the CC Eqs. \ref{['eq:amp']}. Please refer to the original Ref. Kallay2001 for a detailed description. This algorithm has been extended in tenpi to support matrix elements with any bra and ket excitation levels, arbitrary interaction, excitation and de-excitation operators or $\exp(\hat{T})$.
  • Figure 4: The workflow of tenpi. First, diagrams are generated. These are then translated to a line chart representation. The CC interpretation rules are applied and both diagrams and equation terms are printed in a textbook-like PDF format. The permutations are applied and resulting code is optimized using OpMin. The produced intermediates are reoptimized to decrease memory cost using the algorithm in Fig. \ref{['algo:remove']}. The correctness of intermediates is tested using generated simplistic brute-force python script. Equations are printed in readable format in each of these steps. The entire source code files are generated as required (ExaTENSOR FORTRAN, NumPy python, etc.). The procedure can be fully customized as all these steps are calls to the high-level interface of the tenpi python library.tenpi
  • Figure 5: High-level representations of the problem in the tenpi implementation. An example is found on the bottom-right of most representations in blue. Solid arrows between representations show implemented conversions, which typically follow the workflow of the program. Dashed arrows indicate the four possible entry-points for user input. Once one uses tenpi as a library or in interactive regime, a broad class of customizations becomes available, e.g. adding energy denominators to selected terms or contracting open lines of two diagrams together to get a scalar. Output code is in the bottom row. The second blue line under matrix elements shows how is the example matrix element actually input. Superscript in the list of lines under Diagram class numbers the nodes of an operator. The list of lines represents a directed graph corresponding to the diagram. Contraction class is a central representation in the code. It features a range of processing tools for symbolic manipulations on a set of equations, like substitution of terms, their products, contraction of terms, index permutations and detection when a tensor can be deallocated. Bruteforce python test loops serve to check numerically the equivalence of two sets of equations (see section \ref{['implementations']}).
  • ...and 6 more figures