Table of Contents
Fetching ...

Efficient Distributed MLLM Training with Cornstarch

Insu Jang, Runyu Lu, Nikhil Bansal, Ang Chen, Mosharaf Chowdhury

TL;DR

The paper tackles the bottlenecks of distributed training for multimodal LLMs by identifying higher-order heterogeneities—frozen-status differences across modules and non-causal cross-modality attention—and introduces Cornstarch, a framework that combines frozen-status-aware pipeline parallelism with token workload-balanced context parallelism. It deploys multimodality-aware model parallelization, including encoder-parallel and encoder-colocated modality strategies, alongside memory-aware scheduling and bitfield attention to efficiently handle graph-like MLLM execution. Through ILP/LPT-based token distribution and intra-GPU subblock balancing, Cornstarch achieves substantial throughput gains, with an average improvement of 2.26x over strong baselines and up to 2.46x improvements from frozen-status-aware scheduling. The approach is validated on a multi-node GPU cluster with synthetic multimodal data, and the open-source release enables broader adoption for scalable MLLM training across diverse modalities.

Abstract

Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. While there are a few works that have attempted to address the heterogeneity in MLLM training, their approaches are limited to only superficially considering the characteristics of MLLMs. In this paper, we present Cornstarch, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. Cornstarch introduces frozen-aware pipeline parallelism and token workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that Cornstarch outperforms state-of-the-art solutions by $2.26\times$ on average in terms of MLLM training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.

Efficient Distributed MLLM Training with Cornstarch

TL;DR

The paper tackles the bottlenecks of distributed training for multimodal LLMs by identifying higher-order heterogeneities—frozen-status differences across modules and non-causal cross-modality attention—and introduces Cornstarch, a framework that combines frozen-status-aware pipeline parallelism with token workload-balanced context parallelism. It deploys multimodality-aware model parallelization, including encoder-parallel and encoder-colocated modality strategies, alongside memory-aware scheduling and bitfield attention to efficiently handle graph-like MLLM execution. Through ILP/LPT-based token distribution and intra-GPU subblock balancing, Cornstarch achieves substantial throughput gains, with an average improvement of 2.26x over strong baselines and up to 2.46x improvements from frozen-status-aware scheduling. The approach is validated on a multi-node GPU cluster with synthetic multimodal data, and the open-source release enables broader adoption for scalable MLLM training across diverse modalities.

Abstract

Multimodal large language models (MLLMs) extend the capabilities of large language models (LLMs) by combining heterogeneous model architectures to handle diverse modalities like images and audio. However, this inherent heterogeneity in MLLM model structure and data types makes makeshift extensions to existing LLM training frameworks unsuitable for efficient MLLM training. While there are a few works that have attempted to address the heterogeneity in MLLM training, their approaches are limited to only superficially considering the characteristics of MLLMs. In this paper, we present Cornstarch, an efficient distributed MLLM training framework that contemplates MLLM's unique characteristics in both model and data parallelization. Cornstarch introduces frozen-aware pipeline parallelism and token workload-balanced context parallelism to improve MLLM training throughput. Our extensive evaluation shows that Cornstarch outperforms state-of-the-art solutions by on average in terms of MLLM training throughput. Cornstarch is an open-source project available at https://github.com/cornstarch-org/Cornstarch.

Paper Structure

This paper contains 28 sections, 2 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: MLLM model architecture and dataflow.
  • Figure 2: Execution time of a VLM (Siglip + Llama-3.2 1b) with different combination of frozen status using pipeline parallelism on 4 NVIDIA A40 GPUs. The number of microbatch is 64. Optimal iteration time is computed based on the minimum pipeline bubble ratio megatronturingnlg-arxiv22.
  • Figure 3: Balanced context parallelism optimized for LLMs. It is not applicable to MLLMs.
  • Figure 4: Parallelization of an MLLM with two modality encoders and an LLM using different modality parallelism.
  • Figure 5: An example of computing $k_{\text{if}}$. We view the MLLM as a set of multiple sequential pipelines, each of which includes one modality encoder and the LLM. Note that this view is only for computing $k_{\text{if}}$. The actual pipeline schedule is graph-like as in Figure \ref{['fig:model_parallelism_1f1b_schedule']}.
  • ...and 8 more figures