Table of Contents
Fetching ...

Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

Yanbing Chen, Ruilin Wang, Zihao Yang, Lavender Yao Jiang, Eric Karl Oermann

TL;DR

This work analyzes how data packing strategies and the choice of atom size affect language model training. By pretraining GPT-2 124M on WikiText with two packing methods (concat and padding) across multiple atom sizes and $MSL$ values, and keeping a fixed parameter count via Alibi, the authors evaluate final perplexity, perplexity ranking, and efficiency. They find that matching the atom size to $MSL$ optimizes performance for both packing methods, with padding delivering lower final perplexity but requiring more training steps and lower efficiency. The results provide practical guidance for selecting packing strategies based on data availability and training time, highlighting a performance-efficiency trade-off that researchers can leverage in LM pretraining.

Abstract

Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and padding), and padding yields lower final perplexity (higher performance) than concatenation at the cost of more training steps and lower compute efficiency. This trade-off informs the choice of packing methods in training language models.

Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

TL;DR

This work analyzes how data packing strategies and the choice of atom size affect language model training. By pretraining GPT-2 124M on WikiText with two packing methods (concat and padding) across multiple atom sizes and values, and keeping a fixed parameter count via Alibi, the authors evaluate final perplexity, perplexity ranking, and efficiency. They find that matching the atom size to optimizes performance for both packing methods, with padding delivering lower final perplexity but requiring more training steps and lower efficiency. The results provide practical guidance for selecting packing strategies based on data availability and training time, highlighting a performance-efficiency trade-off that researchers can leverage in LM pretraining.

Abstract

Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and padding), and padding yields lower final perplexity (higher performance) than concatenation at the cost of more training steps and lower compute efficiency. This trade-off informs the choice of packing methods in training language models.
Paper Structure (26 sections, 2 equations, 10 figures, 3 tables)

This paper contains 26 sections, 2 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Comparisons across concat models with different atom sizes when MSL is 64. Smaller or larger atom sizes than 1MSL increase perplexity. The model with 1MSL as the atom size has the lowest final perplexity at the end of 2 epochs, indicating the best performance.
  • Figure 2: Comparisons across padding models with different atom sizes when MSL is 64. Smaller or larger atom sizes than MSL increase perplexity. The model with 1MSL as the atom size has the lowest final perplexity at the end of 2 epochs, indicating the best performance.
  • Figure 3: Step-wise comparison of perplexity between padding and concat models under different MSLs (the first 2,000 steps discarded due to high perplexity). Padding (orange) has lower final perplexities (better performance) while concat (blue) has smaller training step sizes over 2 epochs.
  • Figure 4: The distribution of tokenized sequence lengths in WikiText-103-raw with 10,000 random samples. The dataset mostly consists of short paragraphs with length 0 to 200.
  • Figure 5: Illustration of packing steps of padding, when MSL is 32 and atom size is 64. The "tail" subsequence contains fewer tokens than the specified atom size and is padded to meet the MSL requirement, ensuring consistency in sequence length.
  • ...and 5 more figures