Table of Contents
Fetching ...

PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline

Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, Haibo Chen

TL;DR

PipeWeaver tackles dynamic imbalance in large multimodal model training by introducing a dynamic interleaved pipeline that adaptively schedules modality-aware pipeline segments to current data batches. Central to this approach are SEMU, a fast step emulator with spatial-temporal subgraph reuse for accurate yet efficient performance estimates, and a hierarchical search that combines modality-module ranking, stage interleaving, and model-layer tuning. The system demonstrates up to 97.3% throughput improvements over state-of-the-art baselines and maintains high hardware utilization across dynamic workloads, validated through end-to-end experiments and large-scale simulations. These results suggest substantial practical impact for training diverse LMMs efficiently on large GPU clusters, enabling faster iteration and broader deployment of multimodal capabilities.

Abstract

Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data. In this paper, we present PipeWeaver, a dynamic pipeline scheduling framework designed for LMM training. The core of PipeWeaver is dynamic interleaved pipeline, which searches for pipeline schedules dynamically tailored to current training batches. PipeWeaver addresses issues of LMM training with two techniques: adaptive modality-aware partitioning and efficient pipeline schedule search within a hierarchical schedule space. Meanwhile, PipeWeaver utilizes SEMU (Step Emulator), a training simulator for multimodal models, for accurate performance estimations, accelerated by spatial-temporal subgraph reuse to improve search efficiency. Experiments show that PipeWeaver can enhance LMM training efficiency by up to 97.3% compared to state-of-the-art systems, and demonstrate excellent adaptivity to LMM training's data dynamicity.

PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline

TL;DR

PipeWeaver tackles dynamic imbalance in large multimodal model training by introducing a dynamic interleaved pipeline that adaptively schedules modality-aware pipeline segments to current data batches. Central to this approach are SEMU, a fast step emulator with spatial-temporal subgraph reuse for accurate yet efficient performance estimates, and a hierarchical search that combines modality-module ranking, stage interleaving, and model-layer tuning. The system demonstrates up to 97.3% throughput improvements over state-of-the-art baselines and maintains high hardware utilization across dynamic workloads, validated through end-to-end experiments and large-scale simulations. These results suggest substantial practical impact for training diverse LMMs efficiently on large GPU clusters, enabling faster iteration and broader deployment of multimodal capabilities.

Abstract

Large multimodal models (LMMs) have demonstrated excellent capabilities in both understanding and generation tasks with various modalities. While these models can accept flexible combinations of input data, their training efficiency suffers from two major issues: pipeline stage imbalance caused by heterogeneous model architectures, and training data dynamicity stemming from the diversity of multimodal data. In this paper, we present PipeWeaver, a dynamic pipeline scheduling framework designed for LMM training. The core of PipeWeaver is dynamic interleaved pipeline, which searches for pipeline schedules dynamically tailored to current training batches. PipeWeaver addresses issues of LMM training with two techniques: adaptive modality-aware partitioning and efficient pipeline schedule search within a hierarchical schedule space. Meanwhile, PipeWeaver utilizes SEMU (Step Emulator), a training simulator for multimodal models, for accurate performance estimations, accelerated by spatial-temporal subgraph reuse to improve search efficiency. Experiments show that PipeWeaver can enhance LMM training efficiency by up to 97.3% compared to state-of-the-art systems, and demonstrate excellent adaptivity to LMM training's data dynamicity.

Paper Structure

This paper contains 28 sections, 1 equation, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison between CLIP-based models and LMMs.
  • Figure 2: An example of LMM that uses a large language model as the backbone. The user prompt contains an image and a video clip, and is converted into tokens by corresponding modality encoders, which are further processed by the backbone model to produce the response in multimodal text or speech audio.
  • Figure 3: (a-b) The distribution of text tokens per image in OBELICS obelics-arxiv23, LAION-2B laion5b-nips22, and ScienceQA scienceqa-nips22 image datasets, and text tokens per second in ShareGPT4Video share-gpt4-video-arxiv24, InternVid internvid-arxiv23, and MMTrail-2M mmtrail-arxiv24 video datasets. The Y-axis shows normalized data proportions. (c--d) Computational requirements across 100 data batches for VLM llama3-arxiv24 and T2V models after data packing. The X-axis represents the data batch index sorted in ascending order by total computational cost, while the Y-axis indicates floating-point operations measured in TFLOPs.
  • Figure 4: Illustration for the impact of dynamic imbalance.
  • Figure 5: Overall workflow of PipeWeaver's training planner.
  • ...and 9 more figures