Table of Contents
Fetching ...

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li

TL;DR

This work proposes Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training and develops a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch.

Abstract

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 $\times$ speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism

TL;DR

This work proposes Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training and develops a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch.

Abstract

Scaling long-context capabilities is crucial for Multimodal Large Language Models (MLLMs). However, real-world multimodal datasets are extremely heterogeneous. Existing training frameworks predominantly rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Dynamic Hybrid Parallelism (DHP), an efficient parallelism strategy that adaptively reconfigures communication groups and parallelism degrees during MLLM training. We generalize the non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. DHP is able to maintain high hardware efficiency even under extreme data variability. Experimental results demonstrate that DHP significantly outperforms Megatron-LM and DeepSpeed, achieving up to 1.36 speedup in training throughput while maintaining near-linear scaling efficiency across large-scale NPU clusters.
Paper Structure (22 sections, 7 equations, 6 figures, 5 tables)

This paper contains 22 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Data distribution in MSRVTT, InternVid, and OpenVid.
  • Figure 2: Static Mesh v.s. Dynamic Mesh.
  • Figure 3: Overall workflow of DHP.
  • Figure 4: Average iteration time (in seconds) comparison across MSRVTT, InternVid, and OpenVid datasets for InternVL3-2B/8B, InternVL2.5-4B and Qwen3VL-2B/4B/8B models, evaluated with Megatron-LM (diagonal stripes), DeepSpeed (dots), and DHP (vertical stripes). Acceleration ratios (e.g., 1.29x) are annotated above bars, highlighting DHP and DeepSpeed’s speedup over Megatron-LM.
  • Figure 5: Token throughput (in k tokens/s) comparison across different NPU counts (8, 16, 32, and 64) for three training methods: DHP, DeepSpeed (1x), and Megatron-LM. DHP consistently achieves the highest throughput across all NPU configurations and exhibits a slight upward trend as NPU count increases, while DeepSpeed (1x) and Megatron-LM deliver lower throughput with relatively flat scaling behavior.
  • ...and 1 more figures