Table of Contents
Fetching ...

DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning

Keer Lu, Xiaonan Nie, Zheng Liang, Da Pan, Shusen Zhang, Keshi Zhao, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang

TL;DR

DataSculpt reframes long-context LLM training as a multi-objective combinatorial optimization problem to optimize data organization across domains. It employs a coarse-to-fine pipeline: Phase 1 semantic clustering using a FAISS-augmented ISODATA variant, followed by Phase 2 a greedy, multi-objective allocation that maximizes relevance and document integrity while minimizing cross-document truncation. Across a 7B decoder-only model pre-trained on 15B tokens with 16K–64K contexts, DataSculpt yields substantial gains in retrieval augmentation, summarization, reading comprehension, and code completion, while preserving or modestly improving general understanding. The approach demonstrates strong improvements over baselines, indicating that principled data construction is a critical lever for expanding effective context lengths in LLMs.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated significant improvements across a variety of tasks, one of which is the long-context capability. The key to improving long-context performance lies in effective data organization and management strategies that integrate data from multiple domains and optimize the context window during training. Through extensive experimental analysis, we identified three key challenges in designing effective data management strategies that enable the model to achieve long-context capability without sacrificing performance in other tasks: (1) a shortage of long documents across multiple domains, (2) effective construction of context windows, and (3) efficient organization of large-scale datasets. To address these challenges, we introduce DataSculpt, a novel data management framework designed for long-context training. We first formulate the organization of training data as a multi-objective combinatorial optimization problem, focusing on attributes including relevance, homogeneity, integrity, and efficiency. Specifically, our approach utilizes a coarse-to-fine methodology to optimize training data organization both efficiently and effectively. We begin by clustering the data based on semantic similarity (coarse), followed by a multi-objective greedy search within each cluster to score and concatenate documents into various context windows (fine). Our comprehensive evaluations demonstrate that DataSculpt significantly enhances long-context training performance, resulting in improvements of 18.09% in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% increase in code completion, while also maintaining overall model proficiency with a 4.88% improvement.

DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning

TL;DR

DataSculpt reframes long-context LLM training as a multi-objective combinatorial optimization problem to optimize data organization across domains. It employs a coarse-to-fine pipeline: Phase 1 semantic clustering using a FAISS-augmented ISODATA variant, followed by Phase 2 a greedy, multi-objective allocation that maximizes relevance and document integrity while minimizing cross-document truncation. Across a 7B decoder-only model pre-trained on 15B tokens with 16K–64K contexts, DataSculpt yields substantial gains in retrieval augmentation, summarization, reading comprehension, and code completion, while preserving or modestly improving general understanding. The approach demonstrates strong improvements over baselines, indicating that principled data construction is a critical lever for expanding effective context lengths in LLMs.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated significant improvements across a variety of tasks, one of which is the long-context capability. The key to improving long-context performance lies in effective data organization and management strategies that integrate data from multiple domains and optimize the context window during training. Through extensive experimental analysis, we identified three key challenges in designing effective data management strategies that enable the model to achieve long-context capability without sacrificing performance in other tasks: (1) a shortage of long documents across multiple domains, (2) effective construction of context windows, and (3) efficient organization of large-scale datasets. To address these challenges, we introduce DataSculpt, a novel data management framework designed for long-context training. We first formulate the organization of training data as a multi-objective combinatorial optimization problem, focusing on attributes including relevance, homogeneity, integrity, and efficiency. Specifically, our approach utilizes a coarse-to-fine methodology to optimize training data organization both efficiently and effectively. We begin by clustering the data based on semantic similarity (coarse), followed by a multi-objective greedy search within each cluster to score and concatenate documents into various context windows (fine). Our comprehensive evaluations demonstrate that DataSculpt significantly enhances long-context training performance, resulting in improvements of 18.09% in retrieval augmentation, 21.23% in summarization, 21.27% in reading comprehension, and a 3.81% increase in code completion, while also maintaining overall model proficiency with a 4.88% improvement.
Paper Structure (28 sections, 7 equations, 6 figures, 9 tables, 2 algorithms)

This paper contains 28 sections, 7 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: An illustration of the data processing workflow, training task, context window and self-attention mechanism.
  • Figure 2: An illustration on the shortage of long documents across multiple domains. Left: The length distribution of individual documents across various domains, arranged in descending order by the proportion of documents exceeding 64K tokens. Documents, such as books and academic papers predominantly conist of medium to long lengths (L $\ge$ 16K), whereas other domains, particularly web-sourced data, contain documents almost exclusively shorter than 4K tokens. Right: The proportion of each domain in the overall training dataset.
  • Figure 3: Illustration of DataSculpt. In the data preprocessing stage, documents are initially divided into chunks with $\leq$ context length $L$, which are subsequently transformed into vector embeddings. During the semantic clustering phase, we implement a variant of the ISODATA algorithm, which is augmented with the FAISS vector searching library, to aggregate documents based on semantic similarity. Following this, a greedy semantic-driven largest-fit algorithm is utilized to arrange these clusters into sequences that are multi-objective optimally configured for long-context training.
  • Figure 4: Visualization using t-SNE on document embeddings sampled from the web derived (English) training corpus, with each cluster denoted by a distinct color. Left (\ref{['fig:t-SNE']}) presents the semantic clustering results generated by the methodology outlined in \ref{['alg:isodata']}, and right (\ref{['fig:icp_t-SNE']}) illustrates clusters where nodes are united by a shared traversal path, as identified by the greedy graph traversal algorithm discussed in ICLM shi2023context.
  • Figure 5: “Needle in a haystack" performance comparison, where the x-axis represents the document’s length (the “haystack"), and the y-axis reflects the position of the “needle" (a short sentence) within the document, ranging from 1K to 64K tokens. The model’s performance in reciting information from the “needle" across various document lengths and positions is assessed on a scale from 1 to 10, represented through a color gradient transitioning from red (score of 1) to green (score of 10).
  • ...and 1 more figures