Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM
Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu
TL;DR
This work tackles the efficiency bottlenecks of training long-context LLMs under hybrid long/short data by introducing Hierarchical Balance Packing (HBP), a multi-level data-packing framework with auto-selected packing groups and a dynamic training pipeline. By balancing attention complexity and communication overhead across groups, HBP employs GreedyFill-based data packing, adaptive sequence-parallelism, curriculum learning, and a stable Ave-Token loss normalizer to stabilize training. Empirical results show substantial speedups (up to 2.4x on a 236B MoE model) while maintaining performance on both general and long-context tasks across multiple model families, datasets, and scales. The approach demonstrates strong generalization, reduced overhead, and practical impact for large-scale SFT of long-context LLMs in industrial and research settings.
Abstract
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2.4$\times$ with competitive performance. Codes will be released at https://github.com/ModelTC/HBP.
