Table of Contents
Fetching ...

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Lang Xu, Quentin Anthony, Jacob Hatef, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

TL;DR

This work presents a Frontier-specific, topology-aware extension of ZeRO++ that employs a three-level hierarchical partitioning across GCDs, GPUs, and nodes to minimize inter-node communication while keeping memory usage in check. By porting ZeRO++ to AMD GPUs and applying quantization-assisted collectives, the approach achieves substantial throughput gains, including up to a $1.71\times$ TFLOPS-per-GPU improvement and a peak scaling efficiency of $0.94$ for a $20$B GPT model, with further gains from the hierarchical design. The paper validates the strategy on 10B–20B parameter models across up to 48 Frontier nodes, reporting improvements of $40.5\%$, up to $139.8\%$, and $70.7\%$ over ZeRO-3 and ZeRO++, respectively, while maintaining convergence with block-based quantization. It also analyzes Frontier’s hardware topology to justify design choices and discusses system-specific considerations, model-size limits, and directions for broader evaluations. Overall, the work demonstrates that software-hardware co-design leveraging Frontier’s bandwidth hierarchy can significantly enhance large-scale LLM training efficiency on AMD-based HPC systems.

Abstract

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. In this paper, we propose a 3-level hierarchical partitioning specifically for the current 2nd ranked supercomputing cluster, Frontier, which aims at leveraging various bandwidths across layers of communications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication overhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

TL;DR

This work presents a Frontier-specific, topology-aware extension of ZeRO++ that employs a three-level hierarchical partitioning across GCDs, GPUs, and nodes to minimize inter-node communication while keeping memory usage in check. By porting ZeRO++ to AMD GPUs and applying quantization-assisted collectives, the approach achieves substantial throughput gains, including up to a TFLOPS-per-GPU improvement and a peak scaling efficiency of for a B GPT model, with further gains from the hierarchical design. The paper validates the strategy on 10B–20B parameter models across up to 48 Frontier nodes, reporting improvements of , up to , and over ZeRO-3 and ZeRO++, respectively, while maintaining convergence with block-based quantization. It also analyzes Frontier’s hardware topology to justify design choices and discusses system-specific considerations, model-size limits, and directions for broader evaluations. Overall, the work demonstrates that software-hardware co-design leveraging Frontier’s bandwidth hierarchy can significantly enhance large-scale LLM training efficiency on AMD-based HPC systems.

Abstract

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. In this paper, we propose a 3-level hierarchical partitioning specifically for the current 2nd ranked supercomputing cluster, Frontier, which aims at leveraging various bandwidths across layers of communications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication overhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.
Paper Structure (23 sections, 1 equation, 10 figures, 10 tables)

This paper contains 23 sections, 1 equation, 10 figures, 10 tables.

Figures (10)

  • Figure 1: ZeRO-3 across two Frontier nodes.
  • Figure 2: Topology of a DGX A100 compute node
  • Figure 3: Topology of a compute node on ORNL Frontier
  • Figure 4: Weight partition communication in Forward & Backward Pass. This diagram assumes primary and secondary partitions across two GCDs.
  • Figure 5: Gradient partition communication in each step.
  • ...and 5 more figures