Table of Contents
Fetching ...

Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning

Shuhe Wang, Guoyin Wang, Yizhong Wang, Jiwei Li, Eduard Hovy, Chen Guo

TL;DR

This paper performs extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B and provides the first comprehensive analysis of the advantages and limitations of packing versus padding.

Abstract

Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model's maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.

Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning

TL;DR

This paper performs extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B and provides the first comprehensive analysis of the advantages and limitations of packing versus padding.

Abstract

Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model's maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.

Paper Structure

This paper contains 40 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example for the process of padding and packing methods: (1) Padding: Each training sample is appended with the special token “[PAD]” to meet the requirement of the model's input length; (2) Greedy Packing: Each training sample is packed together as much as possible according to its length; and Random Packing: firstly, all training samples are packed into one single sample, and then the single sample is cut into several short training samples according to the maximum input length of the model. It's important to note that random packing can sometimes split a single conversation across two different sequences, as illustrated by the conversation (instruction 2, answer 2) at the bottom of the figure.
  • Figure 2: The results of fine-tuning the LLaMA-3-8B model on the TULU dataset using different linear combinations of batch size and learning rate.
  • Figure 3: The results of fine-tuning the LLaMA-3-8B model by varying the ratio of multi-turn conversations and single-turn conversations.
  • Figure :