Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation

Tuowei Wang; Kun Li; Donglin Bai; Fusong Ju; Leo Xia; Ting Cao; Ju Ren; Yaoxue Zhang; Mao Yang

Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation

Tuowei Wang, Kun Li, Donglin Bai, Fusong Ju, Leo Xia, Ting Cao, Ju Ren, Yaoxue Zhang, Mao Yang

TL;DR

Quantum chemistry workloads exhibit dynamic diversity that undermines GPU efficiency. Matryoshka provides Elastic Parallelism Transformation (EPT) and three core components—Block Constructor, Graph Compiler, and Workload Allocator—to elastically align QC operators with GPU architecture, enabling scalable, accurate ab initio simulations. The approach reduces ERI data layouts from $O(N^4)$ to $O(N^2)$, automates kernel generation offline, and auto-tunes runtime workloads, delivering up to $13.86\times$ speedups and supporting simulations with thousands of atoms on a single GPU. This work enables a paradigm shift toward System4Science, delivering large-scale, GPU-accelerated QC workflows with high utilization and accuracy.

Abstract

AI infrastructures, predominantly GPUs, have delivered remarkable performance gains for deep learning. Conversely, scientific computing, exemplified by quantum chemistry systems, suffers from dynamic diversity, where computational patterns are more diverse and vary dynamically, posing a significant challenge to sponge acceleration off GPUs. In this paper, we propose Matryoshka, a novel elastically-parallel technique for the efficient execution of quantum chemistry system with dynamic diversity on GPU. Matryoshka capitalizes on Elastic Parallelism Transformation, a property prevalent in scientific systems yet underexplored for dynamic diversity, to elastically realign parallel patterns with GPU architecture. Structured around three transformation primitives (Permutation, Deconstruction, and Combination), Matryoshka encompasses three core components. The Block Constructor serves as the central orchestrator, which reformulates data structures accommodating dynamic inputs and constructs fine-grained GPU-efficient compute blocks. Within each compute block, the Graph Compiler operates offline, generating high-performance code with clear computational path through an automated compilation process. The Workload Allocator dynamically schedules workloads with varying operational intensities to threads online. It achieves highly efficient parallelism for compute-intensive operations and facilitates fusion with neighboring memory-intensive operations automatically. Extensive evaluation shows that Matryoshka effectively addresses dynamic diversity, yielding acceleration improvements of up to 13.86x (average 9.41x) over prevailing state-of-the-art approaches on 13 quantum chemistry systems.

Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation

TL;DR

Abstract

Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)