Analysing The Impact of Sequence Composition on Language Model Pre-Training
Yu Zhao, Yuanbin Qu, Konrad Staniszewski, Szymon Tworkowski, Wei Liu, Piotr Miłoś, Yuxiang Wu, Pasquale Minervini
TL;DR
This paper analyzes how pre-training sequence composition and masking influence language model generalisation. It systematically compares MixChunk, UniChunk, and a retrieval-based Bm25Chunk packing strategy under causal masking and intra-document causal masking, formalising how chunks are constructed and how tokens are conditioned. The key findings show cross-document leakage under causal masking can distract training, while intra-document masking reduces distractions and improves performance across language modelling and downstream tasks; Bm25Chunk further enhances in-context learning, knowledge memorisation, and context utilisation with modest efficiency costs. These insights offer practical guidance for resource-constrained pre-training and establish metrics such as distraction and burstiness to link data ordering with model capabilities.
Abstract
Most language model pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on language modelling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of language models without sacrificing efficiency.
