Table of Contents
Fetching ...

SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training

Yuliang Liu, Guohao Wu, Shenglong Zhang, Wei Zhang, Qianchao Zhu, Zhouyang Li, Chenyu Wang

TL;DR

This work tackles the challenge of efficient distributed training for LLMs under extreme context-length variance, where long-tail sequences create cascading workload imbalances and memory/communication bottlenecks. It introduces SlimPack, a slice-level packing framework that decomposes samples into MicroPacks and employs asymmetric partitioning to balance forward and backward passes, guided by a two-phase MILP-based solver and evaluated by a high-fidelity DAG-based simulator. The combination of DP-aware micro-packing, DP-Merge for ultra-long outliers, and zero-overhead runtime integration yields up to 2.8x throughput gains over strong baselines, with improved memory efficiency and reduced pipeline bubbles across multiple models and long-context datasets. These techniques enable more scalable, resource-efficient long-context LLM training, with practical impact on large-scale model development and deployment.

Abstract

The efficient distributed training of Large Language Models (LLMs) is severely hampered by the extreme variance in context lengths. This data heterogeneity, amplified by conventional packing strategies and asymmetric forward-backward costs, leads to critical inefficiencies such as cascading workload imbalances and severe hardware underutilization. Existing solutions attempt to mitigate these challenges, but often at the expense of memory or communication efficiency. To address these challenges, we introduce SlimPack, a framework that fundamentally rethinks data packing and scheduling by decomposing samples into fine-grained slices. This slice-level decomposition immediately mitigates critical memory and communication bottlenecks by transforming large, volatile workloads into a stream of smaller, manageable units. This flexibility is then harnessed for our core innovation, Asymmetric Partitioning, which assembles balanced scheduling units uniquely optimized for the different demands of the forward and backward passes. Orchestrated by a two-phase solver and a high-fidelity simulator, SlimPack holistically resolves imbalances across all parallel dimensions. Extensive experiments demonstrate that SlimPack achieves up to a $2.8\times$ training throughput improvement over baselines, breaking the conventional trade-off by delivering both superior balance and high resource efficiency.

SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training

TL;DR

This work tackles the challenge of efficient distributed training for LLMs under extreme context-length variance, where long-tail sequences create cascading workload imbalances and memory/communication bottlenecks. It introduces SlimPack, a slice-level packing framework that decomposes samples into MicroPacks and employs asymmetric partitioning to balance forward and backward passes, guided by a two-phase MILP-based solver and evaluated by a high-fidelity DAG-based simulator. The combination of DP-aware micro-packing, DP-Merge for ultra-long outliers, and zero-overhead runtime integration yields up to 2.8x throughput gains over strong baselines, with improved memory efficiency and reduced pipeline bubbles across multiple models and long-context datasets. These techniques enable more scalable, resource-efficient long-context LLM training, with practical impact on large-scale model development and deployment.

Abstract

The efficient distributed training of Large Language Models (LLMs) is severely hampered by the extreme variance in context lengths. This data heterogeneity, amplified by conventional packing strategies and asymmetric forward-backward costs, leads to critical inefficiencies such as cascading workload imbalances and severe hardware underutilization. Existing solutions attempt to mitigate these challenges, but often at the expense of memory or communication efficiency. To address these challenges, we introduce SlimPack, a framework that fundamentally rethinks data packing and scheduling by decomposing samples into fine-grained slices. This slice-level decomposition immediately mitigates critical memory and communication bottlenecks by transforming large, volatile workloads into a stream of smaller, manageable units. This flexibility is then harnessed for our core innovation, Asymmetric Partitioning, which assembles balanced scheduling units uniquely optimized for the different demands of the forward and backward passes. Orchestrated by a two-phase solver and a high-fidelity simulator, SlimPack holistically resolves imbalances across all parallel dimensions. Extensive experiments demonstrate that SlimPack achieves up to a training throughput improvement over baselines, breaking the conventional trade-off by delivering both superior balance and high resource efficiency.

Paper Structure

This paper contains 31 sections, 4 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Distribution of sample lengths across pre-training datasets. The pronounced long-tailed pattern is a primary source of workload imbalance.
  • Figure 2: Illustration of the amplifier effect in hybrid parallelism. In a pure DP system (Left), minor workload variations between workers create a small, manageable imbalance. In a hybrid DP+PP system (Right), the strict micro-batch synchronization required by PP magnifies these minor delays. A single straggler (e.g., the "ultra long outlier" in DP0) forces subsequent pipeline stages to wait, creating a large cascading imbalance bubble and significant hardware idleness.
  • Figure 3: Imbalanced backward pass from a balanced forward pass. Despite a balanced forward pass achieved by packing shorter sequences (S2--S6) with a long sequence (S1), the backward pass exhibits significant imbalance. This is due to the higher computational cost of attention ($\times 2.5$) and GEMM ($\times 2$) operations during backward pass.
  • Figure 4: MicroPack formation under a uniform FLOPs budget. Variable-length samples (S1–S9) are transformed into three formation states: Slim—consecutive slices from one long sample; Mix—leftover slices from a slimmed sample co-packed with complete short samples or additional slices (from other samples, order preserved) to meet the budget; Pack—all complete short samples. Note: The attention masks and slice areas are drawn solely to illustrate intra-sample slice dependencies; they are not to scale.
  • Figure 5: SlimPack's PP schedule for 16 samples across two pipelines (10 for DP1 and 6 for DP2), each using 8 MicroPacks. During backpropagation, slices are reallocated for asymmetric partitioning to rebalance workloads. For instance, sample 1 is split differently during forward and backward passes to balance MicroPack loads. Decomposing a long sequence into slices transforms its high-variance compute cost into many low-variance chunks, ensuring each pipeline stage processes near homogeneous workloads and thus prevents straggler-induced bubbles.
  • ...and 12 more figures