Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng, Bangjun Xiao, Lei Shi, Xiaoyang Li, Faming Wu, Tianyu Li, Xuefeng Xiao, Yang Zhang, Yuxuan Wang, Shouda Liu
TL;DR
OrchMLLM tackles the efficiency bottleneck in multimodal LLM training caused by Modality Composition Incoherence by introducing a post-balancing strategy. It pairs Batch Post-Balancing Dispatchers with a MLLM Global Orchestrator to rearrange mini-batches after sampling, achieving balanced GPU utilization across all training phases. The system employs a Node-wise All-to-All Communicator and ILP-based Node-wise Rearrangement to minimize inter-node communication, while the orchestrator ensures correct data assembly and overlapping of computation with communication. Experiments on a $2560$-GPU cluster show MFU of $41.6\%$ and up to $3.1\times$ throughput gains over Megatron-LM, with overhead kept below $2\%$ of forward time, demonstrating strong scalability and practical impact for large-scale MLLM training.
Abstract
Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of $41.6\%$ when training an 84B MLLM with three modalities on $2560$ H100 GPUs, outperforming Megatron-LM by up to $3.1\times$ in throughput.
