Table of Contents
Fetching ...

Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE

Jesun Firoz, Franco Pellegrini, Mario Geiger, Darren Hsu, Jenna A. Bilbrey, Han-Yi Chou, Maximilian Stadler, Markus Hoehnerbach, Tingyu Wang, Dejun Lin, Emine Kucukbenli, Henry W. Sprueill, Ilyes Batatia, Sotiris S. Xantheas, MalSoon Lee, Chris Mundy, Gabor Csanyi, Justin S. Smith, Ponnuswamy Sadayappan, Sutanay Choudhury

TL;DR

This work addresses the scalability challenges of chemistry foundation models that operate on many small 3D molecular graphs by (i) casting data batching as a multi-objective bin-packing problem for balanced GPU workloads and (ii) accelerating the dominant symmetric tensor contraction kernel via kernel fusion and sparsity-aware optimizations. The proposed iterative batching algorithm and kernel-level enhancements yield substantial speedups, achieving roughly a 6× reduction in per-epoch training time on 740 GPUs for a 2.6M-sample dataset, while maintaining comparable learning dynamics. The results demonstrate improved strong and weak scaling, verified across diverse chemical systems and hyperparameter settings, and provide practical guidelines for bin capacity and minibatch sizing. Overall, the approach advances efficient, scalable CFM training and offers broadly applicable techniques for other equivariant GNN-based models in chemistry and materials science.

Abstract

Chemistry Foundation Models (CFMs) that leverage Graph Neural Networks (GNNs) operating on 3D molecular graph structures are becoming indispensable tools for computational chemists and materials scientists. These models facilitate the understanding of matter and the discovery of new molecules and materials. In contrast to GNNs operating on a large homogeneous graphs, GNNs used by CFMs process a large number of geometric graphs of varying sizes, requiring different optimization strategies than those developed for large homogeneous GNNs. This paper presents optimizations for two critical phases of CFM training: data distribution and model training, targeting MACE - a state-of-the-art CFM. We address the challenge of load balancing in data distribution by formulating it as a multi-objective bin packing problem. We propose an iterative algorithm that provides a highly effective, fast, and practical solution, ensuring efficient data distribution. For the training phase, we identify symmetric tensor contraction as the key computational kernel in MACE and optimize this kernel to improve the overall performance. Our combined approach of balanced data distribution and kernel optimization significantly enhances the training process of MACE. Experimental results demonstrate a substantial speedup, reducing per-epoch execution time for training from 12 to 2 minutes on 740 GPUs with a 2.6M sample dataset.

Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE

TL;DR

This work addresses the scalability challenges of chemistry foundation models that operate on many small 3D molecular graphs by (i) casting data batching as a multi-objective bin-packing problem for balanced GPU workloads and (ii) accelerating the dominant symmetric tensor contraction kernel via kernel fusion and sparsity-aware optimizations. The proposed iterative batching algorithm and kernel-level enhancements yield substantial speedups, achieving roughly a 6× reduction in per-epoch training time on 740 GPUs for a 2.6M-sample dataset, while maintaining comparable learning dynamics. The results demonstrate improved strong and weak scaling, verified across diverse chemical systems and hyperparameter settings, and provide practical guidelines for bin capacity and minibatch sizing. Overall, the approach advances efficient, scalable CFM training and offers broadly applicable techniques for other equivariant GNN-based models in chemistry and materials science.

Abstract

Chemistry Foundation Models (CFMs) that leverage Graph Neural Networks (GNNs) operating on 3D molecular graph structures are becoming indispensable tools for computational chemists and materials scientists. These models facilitate the understanding of matter and the discovery of new molecules and materials. In contrast to GNNs operating on a large homogeneous graphs, GNNs used by CFMs process a large number of geometric graphs of varying sizes, requiring different optimization strategies than those developed for large homogeneous GNNs. This paper presents optimizations for two critical phases of CFM training: data distribution and model training, targeting MACE - a state-of-the-art CFM. We address the challenge of load balancing in data distribution by formulating it as a multi-objective bin packing problem. We propose an iterative algorithm that provides a highly effective, fast, and practical solution, ensuring efficient data distribution. For the training phase, we identify symmetric tensor contraction as the key computational kernel in MACE and optimize this kernel to improve the overall performance. Our combined approach of balanced data distribution and kernel optimization significantly enhances the training process of MACE. Experimental results demonstrate a substantial speedup, reducing per-epoch execution time for training from 12 to 2 minutes on 740 GPUs with a 2.6M sample dataset.

Paper Structure

This paper contains 33 sections, 7 equations, 14 figures, 3 tables, 3 algorithms.

Figures (14)

  • Figure 1: Key contrast between different classes of GNNs: Molecular GNNs are trained on many small graphs instead of a single big graph.
  • Figure 2: A schematic diagram of the MACE model (a-d) batatia2022mace. In (a), input and output tensor are shown for each of the embedding, interaction, linear readout, and MLP readout layers. Two interaction layers are shown as this is the number typically used in MACE. The construction of the $A_{i,kl_3m_3}$ is shown in the interaction layer (c) with details of the higher body-order products shown in (d). \ref{['alg:mace_contraction']} shows the details for computing the product in the blue box of (d).
  • Figure 3: Minibatch creation and distribution process of molecular graphs for a GNN model training across 4 compute nodes.
  • Figure 4: A schematic diagram of the tensor contraction performed in Algorithm \ref{['alg:channel_wise']}. Conceptually, two equivariant tensors, $Y_{ji,l_1}^{m_1}$ and $h_{j,kl_2m_2}$, are multiplied via the normal tensor product to form the product tensor $Y_{ji,l_1}^{m_1}\otimes h_{l_2m_2}$ which is contracted using the Clebsch-Gordon coefficients to form the output equivariant tensor $A_{i,kl_3m_3}$. The additional channel dimension $k$ is mixed by the radial embedding $R_{ji,kl_1l_2l_3}$ (not shown). In practice the product is computed for every combination of $l_1,l_2,l_3$ up to $l_{\mathrm{max}}$. The green boxes depict the $l_1=2,l_2=3,l_3=2$ selection and the Clebsch-Gordon coefficient matrix is shown for $l_1=2,l_2=3,l_3=2$ and $m_3=-1$, which is highly sparse.
  • Figure 5: Optimized \ref{['alg:mace_contraction']}
  • ...and 9 more figures