Table of Contents
Fetching ...

Fast Algorithms for Scheduling Many-body Correlation Functions on Accelerators

Oguz Selvitopi, Emin Ozturk, Jie Chen, Ponnuswamy Sadayappan, Robert G. Edwards, Aydın Buluç

TL;DR

The paper tackles the memory bottleneck in LQCD hadronic correlation-function computations that require thousands of large tensor contractions. It introduces two scheduling strategies—sibling scheduler for fast, binary contractions and tree scheduler for global, memory-aware ordering—that operate on a contraction DAG to maximize data reuse and minimize peak memory. Integrated into the Redstar software suite, these schedulers reduce peak memory by up to $2.1\times$, lower evictions by up to $4.2\times$, cut data transfers, and achieve up to $1.9\times$ faster time-to-solution on GPU-accelerated runs. This work offers significant practical impact by enabling larger-scale LQCD analyses through reduced memory traffic and improved performance on accelerators.

Abstract

Computation of correlation functions is a key operation in Lattice quantum chromodynamics (LQCD) simulations to extract nuclear physics observables. These functions involve many binary batch tensor contractions, each tensor possibly occupying hundreds of MBs of memory. Performing these contractions on GPU accelerators poses the challenge of scheduling them as to optimize tensor reuse and reduce data traffic. In this work we propose two fast novel scheduling algorithms that reorder contractions to increase temporal locality via input/intermediate tensor reuse. Our schedulers take advantage of application-specific features, such as contractions being binary and locality within contraction trees, to optimize the objective of minimizing peak memory. We integrate them into the LQCD analysis software suite Redstar and improve time-to-solution. Our schedulers attain upto 2.1x improvement in peak memory, which is reflected by a reduction of upto 4.2x in evictions, upto 1.8x in data traffic, resulting in upto 1.9x faster correlation function computation time.

Fast Algorithms for Scheduling Many-body Correlation Functions on Accelerators

TL;DR

The paper tackles the memory bottleneck in LQCD hadronic correlation-function computations that require thousands of large tensor contractions. It introduces two scheduling strategies—sibling scheduler for fast, binary contractions and tree scheduler for global, memory-aware ordering—that operate on a contraction DAG to maximize data reuse and minimize peak memory. Integrated into the Redstar software suite, these schedulers reduce peak memory by up to , lower evictions by up to , cut data transfers, and achieve up to faster time-to-solution on GPU-accelerated runs. This work offers significant practical impact by enabling larger-scale LQCD analyses through reduced memory traffic and improved performance on accelerators.

Abstract

Computation of correlation functions is a key operation in Lattice quantum chromodynamics (LQCD) simulations to extract nuclear physics observables. These functions involve many binary batch tensor contractions, each tensor possibly occupying hundreds of MBs of memory. Performing these contractions on GPU accelerators poses the challenge of scheduling them as to optimize tensor reuse and reduce data traffic. In this work we propose two fast novel scheduling algorithms that reorder contractions to increase temporal locality via input/intermediate tensor reuse. Our schedulers take advantage of application-specific features, such as contractions being binary and locality within contraction trees, to optimize the objective of minimizing peak memory. We integrate them into the LQCD analysis software suite Redstar and improve time-to-solution. Our schedulers attain upto 2.1x improvement in peak memory, which is reflected by a reduction of upto 4.2x in evictions, upto 1.8x in data traffic, resulting in upto 1.9x faster correlation function computation time.

Paper Structure

This paper contains 15 sections, 7 figures, 4 tables, 8 algorithms.

Figures (7)

  • Figure 1: The contraction DAG formed from three contraction trees. Nodes that appear in multiple trees are shown in green color. The edges that appear in multiple trees are shown with dashed arrows.
  • Figure 2: At each iteration, tree scheduler selects a contraction tree (shown on the right), then processes the nodes in that contraction tree, and then either keeps tensors in memory if they are needed by other contraction trees (shown in the middle) or releases them from memory if they are not needed by any other contractions (shown on the left).
  • Figure 3: Individual gains of nodes in trees after initialization, indicated with the numbers near the nodes. The example assumes the nodes have unit sizes.
  • Figure 4: Various cases that may trigger a gain update regarding node $x$ whose parent $u$ is being processed.
  • Figure 5: Structures of the contraction DAGs.
  • ...and 2 more figures