Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide
Hossam Amer, Rezaul Karim, Ali Pourranjbar, Weiwei Zhang, Walid Ahmed, Boxing Chen
TL;DR
This work surveys distributed parallelism strategies for large language models, emphasizing data, model, activation, and memory optimization techniques and their interactions. It provides theoretical analyses of FLOPs, memory, and communication across GQA, MLP, and Mamba blocks, and demonstrates how hybrid 3D/4D parallelism can be tuned for training and inference workloads. The authors validate insights through case studies on Transformer- and Mamba-based models (LLaMA variants), highlighting when data-parallel, tensor-parallel, pipeline, or context-parallel configurations maximize efficiency and MFU under memory and bandwidth constraints. They propose system design guidelines and discuss auto-parallelization as a promising direction, while outlining key challenges in resource utilization, energy, and cross-layer coherence. Overall, the paper offers a principled framework for selecting parallel strategies, supported by both theory and empirical results, to guide scalable, efficient deployment of next-generation LLMs.
Abstract
With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
