Table of Contents
Fetching ...

Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Jiahao Hu, Shuo Wu, Yazhe Niu, Ruihao Gong, Dahua Lin, Ningyi Xu

TL;DR

This work tackles the efficiency bottlenecks of training long-context LLMs under hybrid long/short data by introducing Hierarchical Balance Packing (HBP), a multi-level data-packing framework with auto-selected packing groups and a dynamic training pipeline. By balancing attention complexity and communication overhead across groups, HBP employs GreedyFill-based data packing, adaptive sequence-parallelism, curriculum learning, and a stable Ave-Token loss normalizer to stabilize training. Empirical results show substantial speedups (up to 2.4x on a 236B MoE model) while maintaining performance on both general and long-context tasks across multiple model families, datasets, and scales. The approach demonstrates strong generalization, reduced overhead, and practical impact for large-scale SFT of long-context LLMs in industrial and research settings.

Abstract

Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2.4$\times$ with competitive performance. Codes will be released at https://github.com/ModelTC/HBP.

Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

TL;DR

This work tackles the efficiency bottlenecks of training long-context LLMs under hybrid long/short data by introducing Hierarchical Balance Packing (HBP), a multi-level data-packing framework with auto-selected packing groups and a dynamic training pipeline. By balancing attention complexity and communication overhead across groups, HBP employs GreedyFill-based data packing, adaptive sequence-parallelism, curriculum learning, and a stable Ave-Token loss normalizer to stabilize training. Empirical results show substantial speedups (up to 2.4x on a 236B MoE model) while maintaining performance on both general and long-context tasks across multiple model families, datasets, and scales. The approach demonstrates strong generalization, reduced overhead, and practical impact for large-scale SFT of long-context LLMs in industrial and research settings.

Abstract

Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MoE model, our method speeds up the training by 2.4 with competitive performance. Codes will be released at https://github.com/ModelTC/HBP.

Paper Structure

This paper contains 42 sections, 16 equations, 11 figures, 23 tables, 7 algorithms.

Figures (11)

  • Figure 1: Difference between naive packing and hierarchical balance packing. Short, medium, and long represent different length samples, and SP Comm refers to the additional communication overhead introduced by enabling sequence parallel (SP) training. ABR (Attention Balance Ratio) measures imbalanced attention computation, and CR (Communication Ratio) measures additional communication overhead, described in Section \ref{['sec:metrics']}.
  • Figure 2: Hierarchical Balance Packing training framework.
  • Figure 3: Example of optimizing packing group for communication.
  • Figure 4: Loss with and without curriculum learning (CL).
  • Figure 5: Grad Norm of Sum loss and Average Token loss.
  • ...and 6 more figures