Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

Hossam Amer; Rezaul Karim; Ali Pourranjbar; Weiwei Zhang; Walid Ahmed; Boxing Chen

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

Hossam Amer, Rezaul Karim, Ali Pourranjbar, Weiwei Zhang, Walid Ahmed, Boxing Chen

TL;DR

This work surveys distributed parallelism strategies for large language models, emphasizing data, model, activation, and memory optimization techniques and their interactions. It provides theoretical analyses of FLOPs, memory, and communication across GQA, MLP, and Mamba blocks, and demonstrates how hybrid 3D/4D parallelism can be tuned for training and inference workloads. The authors validate insights through case studies on Transformer- and Mamba-based models (LLaMA variants), highlighting when data-parallel, tensor-parallel, pipeline, or context-parallel configurations maximize efficiency and MFU under memory and bandwidth constraints. They propose system design guidelines and discuss auto-parallelization as a promising direction, while outlining key challenges in resource utilization, energy, and cross-layer coherence. Overall, the paper offers a principled framework for selecting parallel strategies, supported by both theory and empirical results, to guide scalable, efficient deployment of next-generation LLMs.

Abstract

With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

TL;DR

Abstract

Paper Structure (111 sections, 87 equations, 11 figures, 15 tables)

This paper contains 111 sections, 87 equations, 11 figures, 15 tables.

Introduction
Distributed Strategies Background
Collective Operations
Data Parallelism
Model Parallelism
Activation Parallelization
Memory Optimization Techniques
Activation Checkpointing.
Gradient Release.
Redundancy Elimination (ZeRO).
Memory Offloading.
System-Level Integration.
Performance Implications.
Communication Overlap Techniques
Data Decomposition.
...and 96 more sections

Figures (11)

Figure 1: Overview of the key aspects of scalable efficient distributed systems for AI workloads that we cover in this study.
Figure 2: All-Gather operation: before (left) each rank holds only its reduced slice; after (right) each rank holds all slices.
Figure 3: Reduce-Scatter operation: before (left) each rank holds all slices; after (right) each rank holds the reduced slice.
Figure 4: Illustration of 3D parallelism showing DP, PP, and TP groupings for 8 GPUs with (DP,PP,TP) as (2,2,2). The ranks assignments for parallelization groups here are DP: [[0,1,2,3], [4,5,6,7]]; PP: [[0,2], [1,3], [4,6], [5,7]]; TP: [[0,1], [2,3], [4,5], [6,7]].
Figure 5: Illustration of 4D parallelism showing DP, PP, TP, and CP groupings for a cluster of 16 GPUs with (DP,PP,TP,CP) as (2,2,2,2). The ranks assignments for parallelization groups here are DP: [[0,1,2,3,4,5,6,7], [8,9,10,11,12,13,14,15]]; PP: [[0,1,2,3], [4,5,6,7],[8,9,10,11], [12,13,14,15]]; CP: [[0,2], [1,3],...]; TP: [[0,1], [2,3],...].
...and 6 more figures

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

TL;DR

Abstract

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

Authors

TL;DR

Abstract

Table of Contents

Figures (11)