Table of Contents
Fetching ...

SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading

Qiaoling Chen, Shenggui Li, Wei Gao, Peng Sun, Yonggang Wen, Tianwei Zhang

TL;DR

Long-sequence LLM training is limited by extreme GPU memory and compute demands. SPPO proposes Adaptive Sequence Pipeline Parallel Offloading, combining sequence-aware offloading with two-level activation management and an adaptive pipeline with a heuristic solver and multiplexed sequence partitioning. Empirical results show SPPO achieves up to $3.38x$ throughput improvements over state-of-the-art baselines and enables training of a 7B model with sequence lengths up to $4M$ tokens on $128$ GPUs, significantly expanding practical long-context capabilities. This work advances scalable LLM training by balancing memory and computation through targeted subsequence partitioning and adaptive scheduling, reducing resource requirements for ultra-long sequences.

Abstract

In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences imposes significant challenges due to high GPU memory and computational demands. Existing solutions face two key limitations: (1) memory reduction techniques, such as activation recomputation and CPU offloading, compromise training efficiency; (2) distributed parallelism strategies require excessive GPU resources, limiting the scalability of input sequence length. To address these gaps, we propose Adaptive Sequence Pipeline Parallel Offloading (SPPO), a novel LLM training framework that optimizes memory and computational resource efficiency for long-sequence training. SPPO introduces adaptive offloading, leveraging sequence-aware offloading, and two-level activation management to reduce GPU memory consumption without degrading the training efficiency. Additionally, SPPO develops an adaptive pipeline scheduling approach with a heuristic solver and multiplexed sequence partitioning to improve computational resource efficiency. Experimental results demonstrate that SPPO achieves up to 3.38x throughput improvement over Megatron-LM and DeepSpeed, realizing efficient training of a 7B LLM with sequence lengths of up to 4M tokens on only 128 A100 GPUs.

SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading

TL;DR

Long-sequence LLM training is limited by extreme GPU memory and compute demands. SPPO proposes Adaptive Sequence Pipeline Parallel Offloading, combining sequence-aware offloading with two-level activation management and an adaptive pipeline with a heuristic solver and multiplexed sequence partitioning. Empirical results show SPPO achieves up to throughput improvements over state-of-the-art baselines and enables training of a 7B model with sequence lengths up to tokens on GPUs, significantly expanding practical long-context capabilities. This work advances scalable LLM training by balancing memory and computation through targeted subsequence partitioning and adaptive scheduling, reducing resource requirements for ultra-long sequences.

Abstract

In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities, driving advancements in real-world applications. However, training LLMs on increasingly long input sequences imposes significant challenges due to high GPU memory and computational demands. Existing solutions face two key limitations: (1) memory reduction techniques, such as activation recomputation and CPU offloading, compromise training efficiency; (2) distributed parallelism strategies require excessive GPU resources, limiting the scalability of input sequence length. To address these gaps, we propose Adaptive Sequence Pipeline Parallel Offloading (SPPO), a novel LLM training framework that optimizes memory and computational resource efficiency for long-sequence training. SPPO introduces adaptive offloading, leveraging sequence-aware offloading, and two-level activation management to reduce GPU memory consumption without degrading the training efficiency. Additionally, SPPO develops an adaptive pipeline scheduling approach with a heuristic solver and multiplexed sequence partitioning to improve computational resource efficiency. Experimental results demonstrate that SPPO achieves up to 3.38x throughput improvement over Megatron-LM and DeepSpeed, realizing efficient training of a 7B LLM with sequence lengths of up to 4M tokens on only 128 A100 GPUs.

Paper Structure

This paper contains 21 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Performance comparisons of SPPO and SOTA training systems on extremely long sequences of total 1B tokens, using 32, 64, and 128 GPUs, respectively.
  • Figure 2: The evolution of GPU computing power and PCIe.
  • Figure 3: Illustration of the pipeline parallelism scheduling with the increasing sequence length and our sequence pipeline parallel offloading scheduling.
  • Figure 4: Background & motivation. Imbalanced computation across subsequences with its FLOPs-based offloading policy and the following memory allocation in one step.
  • Figure 5: Activation memory allocation across subsequences when applying the optimal strategy of partitioning the sequence to 8 and 16 subsequences, respectively. The model is LLaMA-65B, and the sequence length is 128K.
  • ...and 7 more figures

Theorems & Definitions (2)

  • definition 1: Subsequence Identification Mapping
  • definition 2: Inter-Stage Communication Scope