Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness
Lars Hillebrand, Prabhupad Pradhan, Christian Bauckhage, Rafet Sifa
TL;DR
This work introduces pointer-guided segment ordering (SO), a pre-training objective that uses a self-attention pointer network to restore the original order of shuffled text segments, thereby enriching paragraph-level contextual representations in large language models when combined with masked language modeling ($ ext{MLM}$). It couples SO with a dynamic sampling strategy during fine-tuning to maximize context utilization and improve sample efficiency, particularly for long documents. The method is implemented on encoder-based architectures (e.g., BERT/RoBERTa variants) and evaluated on diverse scientific and financial datasets, achieving state-of-the-art or competitive results in sequential text classification tasks. Limitations include a 512-token context window and absolute positional embeddings, with future work targeting longer contexts, relative positional encodings, and retrieval/semantic search applications.
Abstract
We introduce "pointer-guided segment ordering" (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.
