Fast Algorithms for Scheduling Many-body Correlation Functions on Accelerators
Oguz Selvitopi, Emin Ozturk, Jie Chen, Ponnuswamy Sadayappan, Robert G. Edwards, Aydın Buluç
TL;DR
The paper tackles the memory bottleneck in LQCD hadronic correlation-function computations that require thousands of large tensor contractions. It introduces two scheduling strategies—sibling scheduler for fast, binary contractions and tree scheduler for global, memory-aware ordering—that operate on a contraction DAG to maximize data reuse and minimize peak memory. Integrated into the Redstar software suite, these schedulers reduce peak memory by up to $2.1\times$, lower evictions by up to $4.2\times$, cut data transfers, and achieve up to $1.9\times$ faster time-to-solution on GPU-accelerated runs. This work offers significant practical impact by enabling larger-scale LQCD analyses through reduced memory traffic and improved performance on accelerators.
Abstract
Computation of correlation functions is a key operation in Lattice quantum chromodynamics (LQCD) simulations to extract nuclear physics observables. These functions involve many binary batch tensor contractions, each tensor possibly occupying hundreds of MBs of memory. Performing these contractions on GPU accelerators poses the challenge of scheduling them as to optimize tensor reuse and reduce data traffic. In this work we propose two fast novel scheduling algorithms that reorder contractions to increase temporal locality via input/intermediate tensor reuse. Our schedulers take advantage of application-specific features, such as contractions being binary and locality within contraction trees, to optimize the objective of minimizing peak memory. We integrate them into the LQCD analysis software suite Redstar and improve time-to-solution. Our schedulers attain upto 2.1x improvement in peak memory, which is reflected by a reduction of upto 4.2x in evictions, upto 1.8x in data traffic, resulting in upto 1.9x faster correlation function computation time.
