Balancing Pipeline Parallelism with Vocabulary Parallelism
Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan
TL;DR
This work identifies a critical imbalance in pipeline-parallel training of large language models: the vocabulary-related layers introduce disproportionate compute and memory demands that create pipeline bubbles and peak memory bottlenecks. It introduces Vocabulary Parallelism, which partitions vocabulary layers across pipeline devices and reframes their computation as passes that can be integrated into existing schedules with minimal activation-memory overhead. Two algorithms reduce communication barriers in the vocabulary passes, and a general scheduling approach allows seamless incorporation into common schedules, yielding up to 51% throughput improvements and significantly reduced peak memory for large vocabularies. The approach is validated on Megatron-LM style pipelines across multiple model and vocabulary sizes, and shows strong gains, especially when combined with memory-balanced schedules like V-Half, with open-source implementation available at the provided repository.
Abstract
Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .
