Table of Contents
Fetching ...

Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

Chaochen Gao, Xing Wu, Qi Fu, Songlin Hu

TL;DR

Quest tackles the challenge of long-context learning by introducing a query-centric data synthesis framework that balances semantic relevance and context diversity. It predicts multiple potential queries per document, groups documents by shared queries and keywords, and concatenates diverse yet relevant documents to form long-context training data. Across 32k–128k contexts and up to 1M tokens, Quest consistently outperforms standard and similarity-based synthesis methods, and scales effectively to state-of-the-art models like LLaMA3-8B-128k, achieving top open-source performance and approaching GPT-4 Turbo levels on ultra-long tasks. The work also establishes a measurable scaling law for synthesized long-context data and demonstrates improved domain coverage, robust short-context performance, and broad applicability to various model sizes.

Abstract

Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens and confirming its scalability across various model sizes.

Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

TL;DR

Quest tackles the challenge of long-context learning by introducing a query-centric data synthesis framework that balances semantic relevance and context diversity. It predicts multiple potential queries per document, groups documents by shared queries and keywords, and concatenates diverse yet relevant documents to form long-context training data. Across 32k–128k contexts and up to 1M tokens, Quest consistently outperforms standard and similarity-based synthesis methods, and scales effectively to state-of-the-art models like LLaMA3-8B-128k, achieving top open-source performance and approaching GPT-4 Turbo levels on ultra-long tasks. The work also establishes a measurable scaling law for synthesized long-context data and demonstrates improved domain coverage, robust short-context performance, and broad applicability to various model sizes.

Abstract

Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens and confirming its scalability across various model sizes.
Paper Structure (35 sections, 1 equation, 11 figures, 17 tables, 1 algorithm)

This paper contains 35 sections, 1 equation, 11 figures, 17 tables, 1 algorithm.

Figures (11)

  • Figure 1: The Needle-in-a-Haystack task evaluates a model's ability to retrieve specific information (the needle) from a large collection of documents (the haystack). Following LongVA zhang2024longva and LWM liu2024world, where the x-axis represents the document length and the y-axis indicates the position of the "needle" within the document, ranging from 25K to 1M tokens. To the best of our knowledge, Quest is the first base model (without instruction tuning) to achieve 100% accuracy with a 1M context length.
  • Figure 2: The cosine similarity of aggregated documents and the corresponding performance. The dotted lines indicate the performance of the models, with all results normalized to align within the specified similarity range. High similarity means the semantic correlation is strong, and low similarity indicates good context diversity. Quest balances the semantic correlation and context diversity, resulting in the best performance.
  • Figure 3: Overview of Query-centric data synthesis (Quest) method. Unlike the Standard pre-training strategy that randomly shuffled documents in the input context, Quest places relevant documents in the same context.
  • Figure 4: Performance comparison on the Needle-in-a-Haystack task for a collection of base models (without instruction tuning). Quest-LLaMA-3-8B-128k exhibits strong performance, significantly outperforming other open-source models of similar or larger sizes. Unlike Figure \ref{['1M_test']}, task difficulty is increased within the 128k context length by retrieving a sentence rather than a random numeric string. "Acc" denotes the percentage of model responses rated as fully accurate (scoring 10 in GPT-4's evaluation) out of all responses generated.
  • Figure 5: t-SNE visualization of aggregated documents from different methods. The proposed Quest maintains balanced distribution across varying context lengths. See Appendix \ref{['Examplesoft-SNE']} for more examples.
  • ...and 6 more figures