Table of Contents
Fetching ...

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, Bin Cui

TL;DR

FlexSP tackles the inefficiency of fixed, homogeneous sequence parallelism in long-context LLM training by introducing heterogeneity-adaptive sequence parallelism that forms multiple SP groups and assigns sequences to them based on length-driven workload characteristics. The method models the problem as a MILP solved with sequence bucketing via dynamic programming, supported by a Sequence Blaster that creates memory-balanced micro-batches and an overlapped CPU/GPU solver-executor workflow. Key contributions include (i) a new system for varied-length corpora, (ii) a first adaptive approach to match heterogeneous parallelism to data, and (iii) extensive experiments showing up to 1.98x speedup over SOTA systems across GPT-7B/13B/30B with long context. This approach enables more efficient long-context training in real-world datasets with long-tail sequence-length distributions, potentially broadening practical scalability for future LLMs.

Abstract

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each input sequence across multiple devices and necessitates communication to process the sequence. In essence, existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. However, in reality, the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution, which leads to workload heterogeneity. In this paper, we show that employing a single, static strategy results in inefficiency and resource under-utilization, highlighting the need for adaptive approaches to handle the heterogeneous workloads across sequences. To address this, we propose a heterogeneity-adaptive sequence parallelism method. For each training step, our approach captures the variability in sequence lengths and assigns the optimal combination of scattering strategies based on workload characteristics. We model this problem as a linear programming optimization and design an efficient and effective solver to find the optimal solution. Furthermore, we implement our method in a high-performance system that supports adaptive parallelization in distributed LLM training. Experimental results demonstrate that our system outperforms state-of-the-art training frameworks by up to 1.98x.

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

TL;DR

FlexSP tackles the inefficiency of fixed, homogeneous sequence parallelism in long-context LLM training by introducing heterogeneity-adaptive sequence parallelism that forms multiple SP groups and assigns sequences to them based on length-driven workload characteristics. The method models the problem as a MILP solved with sequence bucketing via dynamic programming, supported by a Sequence Blaster that creates memory-balanced micro-batches and an overlapped CPU/GPU solver-executor workflow. Key contributions include (i) a new system for varied-length corpora, (ii) a first adaptive approach to match heterogeneous parallelism to data, and (iii) extensive experiments showing up to 1.98x speedup over SOTA systems across GPT-7B/13B/30B with long context. This approach enables more efficient long-context training in real-world datasets with long-tail sequence-length distributions, potentially broadening practical scalability for future LLMs.

Abstract

Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each input sequence across multiple devices and necessitates communication to process the sequence. In essence, existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. However, in reality, the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution, which leads to workload heterogeneity. In this paper, we show that employing a single, static strategy results in inefficiency and resource under-utilization, highlighting the need for adaptive approaches to handle the heterogeneous workloads across sequences. To address this, we propose a heterogeneity-adaptive sequence parallelism method. For each training step, our approach captures the variability in sequence lengths and assigns the optimal combination of scattering strategies based on workload characteristics. We model this problem as a linear programming optimization and design an efficient and effective solver to find the optimal solution. Furthermore, we implement our method in a high-performance system that supports adaptive parallelization in distributed LLM training. Experimental results demonstrate that our system outperforms state-of-the-art training frameworks by up to 1.98x.

Paper Structure

This paper contains 39 sections, 11 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: An example of heterogeneity-adaptive SP improving training efficiency for varied-length sequences.
  • Figure 2: Distribution of sequence lengths across different datasets. The height of each bar represents the percentage of sequences in the corresponding length range. Details of excessively long sequences are expanded into the right panel.
  • Figure 3: FlexSP system overview.
  • Figure 4: End-to-end evaluation (in seconds per iteration) for specific model sizes and maximum context lengths (Max Seq) across three datasets, shown in each sub-figure. Speedup ratios compared to DeepSpeed (green, left) and Megatron-LM (blue, right) are indicated.
  • Figure 5:
  • ...and 5 more figures