Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Yun Dai; Tejas Dharamsi; Byron Hsu; Tao Song; Hamed Firooz

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Yun Dai, Tejas Dharamsi, Byron Hsu, Tao Song, Hamed Firooz

TL;DR

The paper addresses instability in ZeRO++ hpZ when training very large language models on bandwidth-constrained clusters. It identifies a race condition between asynchronous parameter partitioning and AllGather, and proposes an explicit CUDA synchronization between D2D Memcpy and AllGather to fix it, preserving training efficiency. Empirical results show restored convergence on Falcon-40B and Llama-2-scale models with throughput gains up to $98\%$ over baselines, without sacrificing convergence quality on the MMLU task. This approach enables robust, scalable training of giant transformers on commodity hardware, broadening accessibility and reducing costs for large-scale model development.

Abstract

Training extremely large language models (LLMs) with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when training models with billions of parameters. We then propose a modification to the partitioning algorithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and Llama-2 models demonstrates the updated algorithm's ability to achieve reliable convergence on these massive models, where stock ZeRO++ hpZ fails to converge. The updated algorithm enables robust training of larger models with 98\% throughput and model training speed improvement without sacrificing the quality of convergence.

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

TL;DR

over baselines, without sacrificing convergence quality on the MMLU task. This approach enables robust, scalable training of giant transformers on commodity hardware, broadening accessibility and reducing costs for large-scale model development.

Abstract

Paper Structure (7 sections, 1 equation, 2 figures, 2 tables, 1 algorithm)

This paper contains 7 sections, 1 equation, 2 figures, 2 tables, 1 algorithm.

Introduction
Background
Algorithm
Experimentation
Experimentation setup
Divergence Analysis
Conclusion

Figures (2)

Figure 1: An end-to-end training step on a model with $N$ layers with ZeRO++ $qgZ$ and $hpZ$. Forward and backward pass on layer $k$ is expanded. With ZeRO3 prefetch, the consequent AllGather kernel for the backward pass can be immediately enqueued while post-forward repartition is still in progress.
Figure 2: Validation loss convergence per optimization step without hpZ and with modified hpZ from Algorithm \ref{['alg:cap']}, tested on Llama-2-7b for MMLU dataset. Brown curve demonstrates convergence issue in training without fix.

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

TL;DR

Abstract

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (2)