ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

Feiwen Zhu; Arkadiusz Nowaczynski; Rundong Li; Jie Xin; Yifei Song; Michal Marcinkiewicz; Sukru Burc Eryilmaz; Jun Yang; Michael Andersch

ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

Feiwen Zhu, Arkadiusz Nowaczynski, Rundong Li, Jie Xin, Yifei Song, Michal Marcinkiewicz, Sukru Burc Eryilmaz, Jun Yang, Michael Andersch

TL;DR

ScaleFold targets the longstanding scalability bottlenecks in AlphaFold training by isolating communication imbalances and compute inefficiencies. The approach combines a non-blocking data pipeline, CUDA Graphs, specialized Triton kernels, fused optimizations, automatic fusion via Torch.compile, and asynchronous evaluation to scale training up to $2080$ NVIDIA $H100$ GPUs, achieving a 7.51-minute OpenFold benchmark time and a 10-hour pretraining from scratch. Key contributions include a systematic analysis of AlphaFold’s bottlenecks, two-pronged solutions for communication and compute, and demonstrable speedups over prior work. The practical impact is a dramatically faster path to pretraining AlphaFold-like models, enabling rapid iteration and broader accessibility for protein structure prediction research.

Abstract

AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.

ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

TL;DR

NVIDIA

GPUs, achieving a 7.51-minute OpenFold benchmark time and a 10-hour pretraining from scratch. Key contributions include a systematic analysis of AlphaFold’s bottlenecks, two-pronged solutions for communication and compute, and demonstrable speedups over prior work. The practical impact is a dramatically faster path to pretraining AlphaFold-like models, enabling rapid iteration and broader accessibility for protein structure prediction research.

Abstract

speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.

Paper Structure (27 sections, 11 figures, 1 table)

This paper contains 27 sections, 11 figures, 1 table.

Introduction
Background
The AlphaFold Model
Challenges of the AlphaFold Training
High Memory Consumption
Massive Memory-Bounded Kernels
Suboptimal Key-Operation Performance
Limited Data-Parallel (DP) Degree
Dynamic Axial Parallelism
Scale The AlphaFold Training
Barriers to AlphaFold's Training Scalability
Reduce Communication Imbalance
Non-Blocking Data Pipeline
CUDA Graph
Improve Computation Efficiency
...and 12 more sections

Figures (11)

Figure 1: Structure of the AlphaFold model. Evoformer is the main building block of the AlphaFold model. In the AlphaFold model, Input Embeddings consist of Template Pair Stack, which contains 2 Evoformer blocks. Extra MSA Stack contains 4 Evoformer blocks. Evoformer stack contains 48 Evoformer blocks.
Figure 2: Structure of the Evoformer block.
Figure 3: Breakdown of factors that prevent the AlphaFold training from achieving better scalability. Numbers indicate the relative difference between the actual time and the theoretically optimal time per training step.
Figure 4: Sorted data batch preparation time of AlphaFold's training dataset. Depending on the data sample's initial sequence length and multi-sequence alignment size, the batch preparation time varies significantly, which could cause data pipeline blocking.
Figure 5: (i) The default PyTorch data loading pipeline vs (ii) our proposed pipeline. In PyTorch, the sampler order is enforced by its DataLoader even if it blocks the training: Slow batch denoted as "b" blocks training even though another batch "c" is available. In our proposed design: The batch "c" can be yielded before batch "b", which prevents imbalance and idle ranks.
...and 6 more figures

ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

TL;DR

Abstract

ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

Authors

TL;DR

Abstract

Table of Contents

Figures (11)