Analysing The Impact of Sequence Composition on Language Model Pre-Training

Yu Zhao; Yuanbin Qu; Konrad Staniszewski; Szymon Tworkowski; Wei Liu; Piotr Miłoś; Yuxiang Wu; Pasquale Minervini

Analysing The Impact of Sequence Composition on Language Model Pre-Training

Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Miłoś, Yuxiang Wu, Pasquale Minervini

TL;DR

This paper analyzes how pre-training sequence composition and masking influence language model generalisation. It systematically compares MixChunk, UniChunk, and a retrieval-based Bm25Chunk packing strategy under causal masking and intra-document causal masking, formalising how chunks are constructed and how tokens are conditioned. The key findings show cross-document leakage under causal masking can distract training, while intra-document masking reduces distractions and improves performance across language modelling and downstream tasks; Bm25Chunk further enhances in-context learning, knowledge memorisation, and context utilisation with modest efficiency costs. These insights offer practical guidance for resource-constrained pre-training and establish metrics such as distraction and burstiness to link data ordering with model capabilities.

Abstract

Most language model pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on language modelling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of language models without sacrificing efficiency.

Analysing The Impact of Sequence Composition on Language Model Pre-Training

TL;DR

Abstract

Paper Structure (45 sections, 7 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 45 sections, 7 equations, 7 figures, 10 tables, 1 algorithm.

Introduction
Packing and Masking Strategies for Pre-Training Sequence Composition
Packing Strategies
MixChunk
UniChunk
Bm25Chunk
Masking Strategies
Causal Masking
Intra-Document Causal Masking
Language Model Pre-Training
Settings
Pre-Training Corpora
Pre-Training Models
Results
Experiments on Downstream Tasks
...and 30 more sections

Figures (7)

Figure 1: Packing strategies for pre-training chunks construction. (a) illustrates the compositions of MixChunk and UniChunk; (b) presents the sequence construction process of Bm25Chunk.
Figure 2: Average in-context learning accuracy using different numbers of few-shot demonstrations -- the left and right figures show the results of 2K and 8K models.
Figure 3: Accuracy on Multi-Document Question-Answering (MDQA). The $x$-axis represents the position of the document that contains the answer. The $y$-axis presents the accuracy for a position.
Figure 4: Distracted attention proportions of models. The $x$-axis presents the token position of the second document; the $y$-axis presents the distraction proportion calculated by \ref{['eq:distraction']}. Figures (a) and (b) show the distraction proportion of the first and last layers. Figures (c) and (d) are the average distraction proportion over layers. In Figure (d), we separate documents by a newline token ("$\backslash \text{n}$") and present the distraction proportion of IntraDoc. The results are averaged from $4096$ examples. More analysis is presented in \ref{['sec:more_distraction']}.
Figure 5: Pre-training sequence construction speeds using different buffer sizes $k$ and maximum query lengths $q$. Test on $16$ CPU cores.
...and 2 more figures

Analysing The Impact of Sequence Composition on Language Model Pre-Training

TL;DR

Abstract

Analysing The Impact of Sequence Composition on Language Model Pre-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (7)