DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Yifan Niu; Han Xiao; Dongyi Liu; Wei Zhou; Jia Li

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li

TL;DR

This work proposes Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training and develops a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch.

Abstract

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

TL;DR

Abstract

speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

Paper Structure (22 sections, 7 equations, 6 figures, 5 tables)

This paper contains 22 sections, 7 equations, 6 figures, 5 tables.

Introduction
Related Work
Preliminaries
Multimodal Large Language Models
Distributed Training Paradigms
Method
Problem Formulation
Cost Estimation
Polynomial-Time Problem Solving
Stage 2: Optimal Resource Assignment via 2D-Dynamic Programming.
Overall Workflow and Implementation
Experiments
Experimental Setup
Evaluation
Time Consumption of Solver
...and 7 more sections

Figures (6)

Figure 1: Data distribution in MSRVTT, InternVid, and OpenVid.
Figure 2: Static Mesh v.s. Dynamic Mesh.
Figure 3: Overall workflow of DHP.
Figure 4: Average iteration time (in seconds) comparison across MSRVTT, InternVid, and OpenVid datasets for InternVL3-2B/8B, InternVL2.5-4B and Qwen3VL-2B/4B/8B models, evaluated with Megatron-LM (diagonal stripes), DeepSpeed (dots), and DHP (vertical stripes). Acceleration ratios (e.g., 1.29x) are annotated above bars, highlighting DHP and DeepSpeed’s speedup over Megatron-LM.
Figure 5: Token throughput (in k tokens/s) comparison across different NPU counts (8, 16, 32, and 64) for three training methods: DHP, DeepSpeed (1x), and Megatron-LM. DHP consistently achieves the highest throughput across all NPU configurations and exhibits a slight upward trend as NPU count increases, while DeepSpeed (1x) and Megatron-LM deliver lower throughput with relatively flat scaling behavior.
...and 1 more figures

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

TL;DR

Abstract

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Authors

TL;DR

Abstract

Table of Contents

Figures (6)