Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Johannes Hagemann; Samuel Weinbach; Konstantin Dobler; Maximilian Schall; Gerard de Melo

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Johannes Hagemann, Samuel Weinbach, Konstantin Dobler, Maximilian Schall, Gerard de Melo

TL;DR

This work conducts a comprehensive ablation study of possible training configurations for large language models and finds that using a micro-batch size of 1 usually enables the most efficient training layouts.

Abstract

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a Llama 13B model.

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

TL;DR

Abstract

Paper Structure (41 sections, 5 figures, 3 tables)

This paper contains 41 sections, 5 figures, 3 tables.

Introduction
Background
Data Parallelism.
Tensor Parallelism.
Pipeline Parallelism.
3D Parallelism.
Sequence Parallelism.
Activation Checkpointing.
Fused Kernels.
Flash Attention.
Experimental Setup
Efficient LLM Training Analysis
Fused Kernels and Flash Attention
Activation Checkpointing
Micro-batch size
...and 26 more sections

Figures (5)

Figure 1: Comparison of the MFU with different attention layer optimizations. The optimal 3D layout was selected for each respective setting. Each optimal layout is annotated with its (micro-batch size, tensor parallelism size, pipeline parallelism size). The kernel from Megatron-LM failed to operate with an 8k sequence length.
Figure 2: Comparing MFU of the optimal 3D layout with and without activation checkpointing. Llama 30B with 8k sequence length did not fit into memory without checkpointing. The reported results do not use the RMSNorm kernel. Each optimal layout is annotated with its (micro-batch size, tensor parallelism size, pipeline parallelism size).
Figure 3: MFU of the best-performing run configurations at different fixed micro-batch sizes, visualized by the (activation checkpointing, tensor parallelism size, pipeline parallelism size) triple. The reported results do not use the RMSNorm kernel.
Figure 4: MFU for various model and pipeline parallel configurations for the Llama 13B with 8k sequence length, Llama 30B, and Llama 65B models. Only runs with a micro-batch size of 1, activation checkpointing disabled, FlashAttention-2, and the RMS norm kernel are included; runs that ran out of memory are excluded. The Llama 13B and the Llama 30B with 8k sequence length models are excluded due to limited model parallel configuration options in our sweep.
Figure 5: Comparing MFU of the optimal 3D layout with and without sequence parallelism. The reported results use the RMSNorm kernel. Each optimal layout is annotated with its (micro-batch size, tensor parallelism size, pipeline parallelism size).

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

TL;DR

Abstract

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)