Table of Contents
Fetching ...

Balancing Pipeline Parallelism with Vocabulary Parallelism

Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan

TL;DR

This work identifies a critical imbalance in pipeline-parallel training of large language models: the vocabulary-related layers introduce disproportionate compute and memory demands that create pipeline bubbles and peak memory bottlenecks. It introduces Vocabulary Parallelism, which partitions vocabulary layers across pipeline devices and reframes their computation as passes that can be integrated into existing schedules with minimal activation-memory overhead. Two algorithms reduce communication barriers in the vocabulary passes, and a general scheduling approach allows seamless incorporation into common schedules, yielding up to 51% throughput improvements and significantly reduced peak memory for large vocabularies. The approach is validated on Megatron-LM style pipelines across multiple model and vocabulary sizes, and shows strong gains, especially when combined with memory-balanced schedules like V-Half, with open-source implementation available at the provided repository.

Abstract

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .

Balancing Pipeline Parallelism with Vocabulary Parallelism

TL;DR

This work identifies a critical imbalance in pipeline-parallel training of large language models: the vocabulary-related layers introduce disproportionate compute and memory demands that create pipeline bubbles and peak memory bottlenecks. It introduces Vocabulary Parallelism, which partitions vocabulary layers across pipeline devices and reframes their computation as passes that can be integrated into existing schedules with minimal activation-memory overhead. Two algorithms reduce communication barriers in the vocabulary passes, and a general scheduling approach allows seamless incorporation into common schedules, yielding up to 51% throughput improvements and significantly reduced peak memory for large vocabularies. The approach is validated on Megatron-LM style pipelines across multiple model and vocabulary sizes, and shows strong gains, especially when combined with memory-balanced schedules like V-Half, with open-source implementation available at the provided repository.

Abstract

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .

Paper Structure

This paper contains 42 sections, 5 equations, 17 figures, 7 tables, 2 algorithms.

Figures (17)

  • Figure 1: Repeating pattern in an imbalanced pipeline. Bubbles are incurred due to an extra output layer in the last pipeline stage.
  • Figure 2: Ratio of compute and memory of vocabulary layers compared to transformer layers in Gemma2-9B.
  • Figure 3: Transformer Layer Redistribution for a 7B GPT-like model with vocabulary size 128k. In this case, each stage has 2 transformer layers, while output layer is equivalent to 2.4x of transformer layer on compute and 2.6x on parameter memory.
  • Figure 4: Computation graph of the output layer after partitioning across the vocabulary dimension. There are three all-reduce / reduce communications across all devices.
  • Figure 5: Overlapping all-reduce communication with transformer layer computation.
  • ...and 12 more figures