Table of Contents
Fetching ...

Pier: Efficient Large Language Model pretraining with Relaxed Global Communication

Shuyuan Fan, Zhao Zhang

TL;DR

Global communication bottlenecks dominate cost in large-language-model pretraining. The paper presents Pier, a scalable optimizer built on the DiLoCo framework that relaxes global synchronization and adds momentum warmup and momentum decay to preserve convergence. Pier enables efficient 2D parallelism (data plus tensor parallelism) and memory-aware outer updates, achieving substantial end-to-end speedups while maintaining validation loss and downstream-task performance on GPT-2 variants and OpenWebText. Extensive experiments across GPT-2 small/medium/XL and 7B show speedups up to 3.7x on 256 A100s and meaningful gains on GH200s, with strong scaling and varying group counts. The work demonstrates a practical pathway to accelerate production-scale LLM pretraining without sacrificing model quality.

Abstract

Global communication, such as all-reduce and allgather, is the prominent performance bottleneck in large language model (LLM) pretraining. To address this issue, we present Pier, an efficient and scalable optimizer with relaxed global communication. Pier is built upon DiLoCo, which leverages an inner optimizer within groups of processors and an outer optimizer that requires global communication. To preserve the convergence and model performance, Pier incorporates two key techniques for the outer optimizer: momentum warmup and momentum decay. Pier employs an efficient and scalable system architecture to enable complex parallelization strategies in LLM pretraining. We examine the model performance and runtime reduction of Pier using the GPT model family (e.g., small, medium, XL, and 7B) and the OpenWebText dataset with a suite of thirteen downstream tasks. With data parallel strategy, Pier speeds up GPT-2 XL training by up to 2.7x-3.7x on 256 NVIDIA A100 GPUs and 1.2x-1.9x on 64 GH200 Superchips, respectively, without degradation of validation loss or downstream task performance. With data parallel and tensor parallel, Pier reduces the time cost GPT-2 7B model training by 54.5% on 128 A100s.

Pier: Efficient Large Language Model pretraining with Relaxed Global Communication

TL;DR

Global communication bottlenecks dominate cost in large-language-model pretraining. The paper presents Pier, a scalable optimizer built on the DiLoCo framework that relaxes global synchronization and adds momentum warmup and momentum decay to preserve convergence. Pier enables efficient 2D parallelism (data plus tensor parallelism) and memory-aware outer updates, achieving substantial end-to-end speedups while maintaining validation loss and downstream-task performance on GPT-2 variants and OpenWebText. Extensive experiments across GPT-2 small/medium/XL and 7B show speedups up to 3.7x on 256 A100s and meaningful gains on GH200s, with strong scaling and varying group counts. The work demonstrates a practical pathway to accelerate production-scale LLM pretraining without sacrificing model quality.

Abstract

Global communication, such as all-reduce and allgather, is the prominent performance bottleneck in large language model (LLM) pretraining. To address this issue, we present Pier, an efficient and scalable optimizer with relaxed global communication. Pier is built upon DiLoCo, which leverages an inner optimizer within groups of processors and an outer optimizer that requires global communication. To preserve the convergence and model performance, Pier incorporates two key techniques for the outer optimizer: momentum warmup and momentum decay. Pier employs an efficient and scalable system architecture to enable complex parallelization strategies in LLM pretraining. We examine the model performance and runtime reduction of Pier using the GPT model family (e.g., small, medium, XL, and 7B) and the OpenWebText dataset with a suite of thirteen downstream tasks. With data parallel strategy, Pier speeds up GPT-2 XL training by up to 2.7x-3.7x on 256 NVIDIA A100 GPUs and 1.2x-1.9x on 64 GH200 Superchips, respectively, without degradation of validation loss or downstream task performance. With data parallel and tensor parallel, Pier reduces the time cost GPT-2 7B model training by 54.5% on 128 A100s.

Paper Structure

This paper contains 23 sections, 3 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: Validation Loss Comparison between AdamW (8 GPUs, Fully Synchronized) and DiLoCo (8 Groups, 1 GPU per Group) During the Pretraining of GPT-2 XL Model.
  • Figure 2: Illustration of Inner- and Outer-Communication in a 2D Parallel Training Setup with Data Parallel Size of 4 and a Tensor Parallel Size of 2. There are two local communication groups, each located on a separate compute node.
  • Figure 3: Validation Loss Curves of GPT-2 Small, Medium and XL during Pretraining with AdamW, DiLoCo, and Pier. For GPT-2 small and XL, our approach achieves validation losses that are closer to those of AdamW. For GPT-2 Medium, our approach achieves lower validation loss compared to the original DiLoCo.
  • Figure 4: Validation Loss Curves of Pier under Weak Scaling. The runs with 4 GPUs exhibit convergence. When scaling up to 16 or 32 GPUs, the validation loss rises significantly compared with the default 8 GPUs setting.
  • Figure 5: Runtime and Strong Scaling Efficiency Comparison Between AdamW and Pier. The proposed method achieves up to 1.7x, 2.6x, and 2.7x speedup for pretraining of GPT-2 small, medium, and XL on the NERSC Perlmutter Cluster. Number of communication groups is set to {8, 32, 64} in Figure \ref{['fig:s-s']}, \ref{['fig:s-m']}, \ref{['fig:s-x']}, which has a verified convergence in \ref{['sec:expr:convergence']}.
  • ...and 3 more figures