SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training
Yuliang Liu, Guohao Wu, Shenglong Zhang, Wei Zhang, Qianchao Zhu, Zhouyang Li, Chenyu Wang
TL;DR
This work tackles the challenge of efficient distributed training for LLMs under extreme context-length variance, where long-tail sequences create cascading workload imbalances and memory/communication bottlenecks. It introduces SlimPack, a slice-level packing framework that decomposes samples into MicroPacks and employs asymmetric partitioning to balance forward and backward passes, guided by a two-phase MILP-based solver and evaluated by a high-fidelity DAG-based simulator. The combination of DP-aware micro-packing, DP-Merge for ultra-long outliers, and zero-overhead runtime integration yields up to 2.8x throughput gains over strong baselines, with improved memory efficiency and reduced pipeline bubbles across multiple models and long-context datasets. These techniques enable more scalable, resource-efficient long-context LLM training, with practical impact on large-scale model development and deployment.
Abstract
The efficient distributed training of Large Language Models (LLMs) is severely hampered by the extreme variance in context lengths. This data heterogeneity, amplified by conventional packing strategies and asymmetric forward-backward costs, leads to critical inefficiencies such as cascading workload imbalances and severe hardware underutilization. Existing solutions attempt to mitigate these challenges, but often at the expense of memory or communication efficiency. To address these challenges, we introduce SlimPack, a framework that fundamentally rethinks data packing and scheduling by decomposing samples into fine-grained slices. This slice-level decomposition immediately mitigates critical memory and communication bottlenecks by transforming large, volatile workloads into a stream of smaller, manageable units. This flexibility is then harnessed for our core innovation, Asymmetric Partitioning, which assembles balanced scheduling units uniquely optimized for the different demands of the forward and backward passes. Orchestrated by a two-phase solver and a high-fidelity simulator, SlimPack holistically resolves imbalances across all parallel dimensions. Extensive experiments demonstrate that SlimPack achieves up to a $2.8\times$ training throughput improvement over baselines, breaking the conventional trade-off by delivering both superior balance and high resource efficiency.
