Table of Contents
Fetching ...

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

Zhanda Zhu, Christina Giannoula, Muralidhar Andoorveedu, Qidong Su, Karttikeya Mangalam, Bojian Zheng, Gennady Pekhimenko

TL;DR

Mist tackles the efficiency challenges of distributing large language model training by jointly optimizing memory footprint reductions and parallelism with a focus on overlap and microbatch imbalance. It combines fine-grained overlap-centric scheduling, a symbolic performance prediction framework, and imbalance-aware hierarchical tuning to explore a vast configuration space efficiently. Through extensive experiments across GPT-3, Llama, and Falcon models on diverse GPUs, Mist achieves up to 1.73× speedups over manual baselines and up to 2.04× over automatic baselines, while maintaining accuracy. The approach promises practical impact by enabling faster, more memory-efficient LLM training with reduced tuning overhead.

Abstract

Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28$\times$ (up to 1.73$\times$) and 1.27$\times$ (up to 2.04$\times$) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

TL;DR

Mist tackles the efficiency challenges of distributing large language model training by jointly optimizing memory footprint reductions and parallelism with a focus on overlap and microbatch imbalance. It combines fine-grained overlap-centric scheduling, a symbolic performance prediction framework, and imbalance-aware hierarchical tuning to explore a vast configuration space efficiently. Through extensive experiments across GPT-3, Llama, and Falcon models on diverse GPUs, Mist achieves up to 1.73× speedups over manual baselines and up to 2.04× over automatic baselines, while maintaining accuracy. The approach promises practical impact by enabling faster, more memory-efficient LLM training with reduced tuning overhead.

Abstract

Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28 (up to 1.73) and 1.27 (up to 2.04) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.

Paper Structure

This paper contains 38 sections, 5 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: An illustration of optimization configurations.
  • Figure 2: Motivational examples of tuning parallelism with memory optimizations for GPT-3-2.7B on 4 NVIDIA L4 GPUs with $Seq=4096, B_{global}=8$. Parallelism is always tuned.
  • Figure 3: Motivational example of showing the speedup source of comprehensive co-optimization for GPT-3-7B on 8 NVIDIA L4 GPUs with $Seq = 2048$, $B_{global}=512$.
  • Figure 4: Illustration of pipeline parallelism overlap opportunity and inter-microbatch imbalance. $\text{a}'$ is the extra communication happened in the first microbatch.
  • Figure 5: Growth in the number of configurations within the search space as each optimization is incrementally added.
  • ...and 11 more figures