Table of Contents
Fetching ...

Efficient Coupled-Cluster Python Frameworks for Next-Generation GPUs: A Comparative Study of CuPy and PyTorch on the Hopper and Grace Hopper Architecture

Antonina Dobrowolska, Julian Świerczyński, Paweł Tecmer, Emil Sujkowski, Somayeh Ahmadkhani, Grzegorz Mazur, Klemens Noga, Jeff Hammond, Katharina Boguslawski

Abstract

In this work, we introduce new batching algorithms to effectively handle large contractions encountered in coupled-cluster singles and doubles (CCSD) implementations in Python on the Video Random Access Memory (VRAM) of graphical processing units (GPUs), thereby improving performance. Specifically, we benchmark the performance of the CuPy and PyTorch libraries on a single NVIDIA Hopper (H100) and the Grace Hopper (GH200) architectures. We begin by optimizing the particle-particle ladder bottleneck contraction in CCSD using an asymmetric and dynamic splitting recipe, and then move toward a generic tensor contraction protocol that enables tensor contractions to be performed almost exclusively on GPUs. We benchmark our new, fully generic GPU-accelerated coupled-cluster implementations for various molecular systems and basis-set sizes, using both the CuPy and PyTorch libraries. While PyTorch outperforms CuPy on H100 by approximately 20\%, both perform similarly on the GH200 architecture. Compared to our initial GPU implementation [J. Chem. Theory Comput. 2024, 20, 3, 1130--1142], we achieve a 10-fold speedup. In molecular CCSD calculations, we report additional speedups between 3 and 16 for a single CCSD iteration using Cholesky-decomposed electron repulsion integrals compared to our original GPU-CPU hybrid implementation.

Efficient Coupled-Cluster Python Frameworks for Next-Generation GPUs: A Comparative Study of CuPy and PyTorch on the Hopper and Grace Hopper Architecture

Abstract

In this work, we introduce new batching algorithms to effectively handle large contractions encountered in coupled-cluster singles and doubles (CCSD) implementations in Python on the Video Random Access Memory (VRAM) of graphical processing units (GPUs), thereby improving performance. Specifically, we benchmark the performance of the CuPy and PyTorch libraries on a single NVIDIA Hopper (H100) and the Grace Hopper (GH200) architectures. We begin by optimizing the particle-particle ladder bottleneck contraction in CCSD using an asymmetric and dynamic splitting recipe, and then move toward a generic tensor contraction protocol that enables tensor contractions to be performed almost exclusively on GPUs. We benchmark our new, fully generic GPU-accelerated coupled-cluster implementations for various molecular systems and basis-set sizes, using both the CuPy and PyTorch libraries. While PyTorch outperforms CuPy on H100 by approximately 20\%, both perform similarly on the GH200 architecture. Compared to our initial GPU implementation [J. Chem. Theory Comput. 2024, 20, 3, 1130--1142], we achieve a 10-fold speedup. In molecular CCSD calculations, we report additional speedups between 3 and 16 for a single CCSD iteration using Cholesky-decomposed electron repulsion integrals compared to our original GPU-CPU hybrid implementation.
Paper Structure (12 sections, 6 equations, 7 figures, 4 tables)

This paper contains 12 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Schematic batching procedure required to process large tensors when going from the CPU to the GPU and back.
  • Figure 2: Schematic illustration of the X-split protocol, where the number of batches is determined along the 'a’, 'b’, and 'e’ axes. 'n_a', 'n_b', and 'n_e' indicate the number of batches along the axes 'a', 'b' and 'e', respectively. mem(x-split) is the memory compute function that determines the VRAM required for the selected tensor contraction operation and is defined in Table \ref{['tbl:memory']} and indicated as $\sum n_i * size(i)$ in the decision symbol. The splitting process for the Cholesky vectors has been omitted to facilitate a direct comparison with the C-split algorithm.
  • Figure 3: Schematic illustration of the C-split protocol, where the number of batches is determined along the 'a’, 'b’, and 'c’ axes. 'n_a', 'n_b', and 'n_c' indicate the number of batches along the axes 'a', 'b' and 'c', respectively. mem[step0] and mem[step1] are the memory compute function that determine the VRAM required for the selected tensor contraction operation, while free VRAM denotes the accessible VRAM for the current tensor contraction operation. All memory functions are defined in Table \ref{['tbl:memory']}.
  • Figure 4: Schematic illustration of the generic batching recipe. The tensor contraction is sequenced in optimal pair contractions (the "find optimal path" decision process), where only the first contraction step is batched according to the notation op0,op1->out. 'n_0' and 'n_1' indicate the number of batches along the automatically selected axes of 'op0' and 'op1', respectively. These axes are not summed over and appear both in the input and output arrays, thus ensuring that the batching procedure is passed on to subsequent steps of the optimal contraction path. mem(generic) is the memory compute function that determines the VRAM required for the selected tensor contraction operation (labeled as "ref." in the decision process), while free VRAM denotes the accessible VRAM for the current tensor contraction operation. All memory functions are defined in Table \ref{['tbl:memory']}.
  • Figure 5: Schematic representation of the logic flow of the designed tensor contraction engine in PyBEST.
  • ...and 2 more figures