Table of Contents
Fetching ...

Structured Packing in LLM Training Improves Long Context Utilization

Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Yu Zhao, Henryk Michalewski, Łukasz Kuciński, Piotr Miłoś

TL;DR

This work tackles suboptimal long-context utilization in large language models by proposing Structured Packing for Long Context (SPLiCe), a retrieval-based data organization method that concatenates mutually relevant documents into single training samples. By breadth-first retrieval using BM25 or Contriever (and repository-aware variants), SPLiCe increases dependency density within long contexts and guides the model to leverage distant information. Empirical results across 3B, 7B, and 13B models show SPLiCe improves performance on long-context tasks (Qasper, HotPotQA, Needle In A Haystack) with only modest fine-tuning, while preserving or enhancing short-context benchmarks. The study also reveals cross-domain transfer (code improving natural language tasks), robustness to noisy retrievers, and data-properties like increased burstiness associated with improved in-context learning. Overall, SPLiCe provides a practical, scalable approach to enhance long-context capabilities for diverse data sources and modalities.

Abstract

Recent advancements in long-context large language models have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. This study investigates structuring training data to enhance semantic interdependence, demonstrating that this approach effectively improves context utilization. To this end, we introduce the Structured Packing for Long Context (SPLiCe) method, which utilizes retrieval to collate mutually relevant documents into long and coherent training examples. We validate SPLiCe empirically across models of varying sizes -- 3B, 7B, and 13B -- achieving improved performance in long-context tasks, such as Qasper and HotpotQA. Remarkably, even brief fine-tuning with SPLiCe is sufficient to realize these benefits. Additionally, SPLiCe effectively mitigates the lost-in-middle phenomenon often observed in large models. Our comprehensive analysis of SPLiCe explores its design choices and reveals intriguing transfer effects; for instance, training on programming code enhances performance on natural language tasks.

Structured Packing in LLM Training Improves Long Context Utilization

TL;DR

This work tackles suboptimal long-context utilization in large language models by proposing Structured Packing for Long Context (SPLiCe), a retrieval-based data organization method that concatenates mutually relevant documents into single training samples. By breadth-first retrieval using BM25 or Contriever (and repository-aware variants), SPLiCe increases dependency density within long contexts and guides the model to leverage distant information. Empirical results across 3B, 7B, and 13B models show SPLiCe improves performance on long-context tasks (Qasper, HotPotQA, Needle In A Haystack) with only modest fine-tuning, while preserving or enhancing short-context benchmarks. The study also reveals cross-domain transfer (code improving natural language tasks), robustness to noisy retrievers, and data-properties like increased burstiness associated with improved in-context learning. Overall, SPLiCe provides a practical, scalable approach to enhance long-context capabilities for diverse data sources and modalities.

Abstract

Recent advancements in long-context large language models have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. This study investigates structuring training data to enhance semantic interdependence, demonstrating that this approach effectively improves context utilization. To this end, we introduce the Structured Packing for Long Context (SPLiCe) method, which utilizes retrieval to collate mutually relevant documents into long and coherent training examples. We validate SPLiCe empirically across models of varying sizes -- 3B, 7B, and 13B -- achieving improved performance in long-context tasks, such as Qasper and HotpotQA. Remarkably, even brief fine-tuning with SPLiCe is sufficient to realize these benefits. Additionally, SPLiCe effectively mitigates the lost-in-middle phenomenon often observed in large models. Our comprehensive analysis of SPLiCe explores its design choices and reveals intriguing transfer effects; for instance, training on programming code enhances performance on natural language tasks.
Paper Structure (36 sections, 1 equation, 18 figures, 33 tables, 1 algorithm)

This paper contains 36 sections, 1 equation, 18 figures, 33 tables, 1 algorithm.

Figures (18)

  • Figure 1: SPLiCe vs Example Packing (EP) (baseline) on Needle in a Haystack. A model fine-tuned with SPLiCe achieves perfect accuracy in retrieving fine-grained information over the whole context, while the baseline can only handle a small final segment (details in Appendix N).
  • Figure 2: Training samples generated by Example Packing, Within-Domain Example packing, and SPLiCe. Similar colors and shapes indicate related documents, which could be found using a retrieval method (e.g., BM25 or Contriever) or metadata (e.g., git repository structure).
  • Figure 3: Key-value retrieval performance on a dictionary of $300$ key-value pairs ($\approx$$24$K tokens). The $7$B CL model trained with SPLiCe achieves much higher accuracy on hard-to-retrieve positions in the middle than the Example Packing Baseline. The details about this task can be found in Appendix D. Each position averaged over $500$ examples.
  • Figure 4:
  • Figure 5:
  • ...and 13 more figures