Table of Contents
Fetching ...

DataMan: Data Manager for Pre-training Large Language Models

Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao

TL;DR

DataMan introduces a data management framework that annotates pre-training text with 14 quality criteria and 15 domain types using reverse thinking based on perplexity anomalies. It creates the DataPajama dataset by applying DataMan to SlimPajama's 447B tokens, then demonstrates data selection and domain-mixing benefits by training a 1.3B param model on 30B tokens, achieving improvements in ICL, perplexity, and instruction-following over a state-of-the-art baseline. It further shows that optimizing the Overall Score (l up to 5) yields the strongest gains and that high-quality domain-specific data can boost domain-specific ICL via continued pre-training. The work highlights the partial misalignment between perplexity and ICL, underscores the complementarity of quality criteria, and provides data-sharing resources to the community.

Abstract

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

DataMan: Data Manager for Pre-training Large Language Models

TL;DR

DataMan introduces a data management framework that annotates pre-training text with 14 quality criteria and 15 domain types using reverse thinking based on perplexity anomalies. It creates the DataPajama dataset by applying DataMan to SlimPajama's 447B tokens, then demonstrates data selection and domain-mixing benefits by training a 1.3B param model on 30B tokens, achieving improvements in ICL, perplexity, and instruction-following over a state-of-the-art baseline. It further shows that optimizing the Overall Score (l up to 5) yields the strongest gains and that high-quality domain-specific data can boost domain-specific ICL via continued pre-training. The work highlights the partial misalignment between perplexity and ICL, underscores the complementarity of quality criteria, and provides data-sharing resources to the community.

Abstract

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

Paper Structure

This paper contains 56 sections, 3 equations, 6 figures, 38 tables.

Figures (6)

  • Figure 1: The pipeline of the Sample-with-DataMan model: We derived 14 quality criteria from LLMs' reverse thinking and used DataMan to annotate the quality rating and domain type of pre-training data. By employing data sampling strategies to select subsets, the performance of trained LLMs outperforms the state-of-the-art data sampling baseline.
  • Figure 2: Instruction following win rates of Sample-with-DataMan models v.s. the state-of-the-art baseline (Educational value $\tau = 2.0$) after instruction fine-tuning on 10K ShareGPT examples. The results indicate that, under the same SFT conditions, our model consistently surpasses the SOTA Baseline, with the Overall Score l=5 reaching a impressive win rate at 78.5%.
  • Figure 3: The distribution of quality ratings across different sources in DataPajama
  • Figure 4: Correlations of quality ratings and negative log-likelihood scores by Llama-2-7B touvron2023llama2 over 30B tokens training documents. The negative log-likelihoods are averaged over the number of tokens, and are the logarithm of the perplexity score of a single sequence. We observe that perplexity scores are not good approximations for any quality criteria.
  • Figure 6: Pearson correlation heatmap between 14 quality criteria in the fine-tuning dataset.
  • ...and 1 more figures