Table of Contents
Fetching ...

DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models

Ranchi Zhao, Zhen Leng Thai, Yifan Zhang, Shengding Hu, Yunqi Ba, Jie Zhou, Jie Cai, Zhiyuan Liu, Maosong Sun

TL;DR

DecorateLM tackles the data-quality problem in large language model pretraining by a three-phase decoration framework: rating, tagging, and editing. It leverages a teacher–student distillation pipeline to create a compact DecorateLM that annotates and refines a massive corpus, producing a Decorated Corpus of 45B high-quality tokens from 100B for LM training. Empirical results across diverse benchmarks show that high-quality, well-structured data improves performance and domain coverage, with the integrated Rat. Agg. & Tag. & Edit. strategy delivering the strongest gains. The approach demonstrates a scalable path to enhance pretraining data quality without prohibitive compute, highlighting data-centric methods as a practical lever for LM capability and generalization.

Abstract

The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for DecorateLM using a large language model and distill data engineering expertise into a compact 1.2 billion parameter small language model (SLM). We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM. Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.

DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models

TL;DR

DecorateLM tackles the data-quality problem in large language model pretraining by a three-phase decoration framework: rating, tagging, and editing. It leverages a teacher–student distillation pipeline to create a compact DecorateLM that annotates and refines a massive corpus, producing a Decorated Corpus of 45B high-quality tokens from 100B for LM training. Empirical results across diverse benchmarks show that high-quality, well-structured data improves performance and domain coverage, with the integrated Rat. Agg. & Tag. & Edit. strategy delivering the strongest gains. The approach demonstrates a scalable path to enhance pretraining data quality without prohibitive compute, highlighting data-centric methods as a practical lever for LM capability and generalization.

Abstract

The performance of Large Language Models (LLMs) is substantially influenced by the pretraining corpus, which consists of vast quantities of unsupervised data processed by the models. Despite its critical role in model performance, ensuring the quality of this data is challenging due to its sheer volume and the absence of sample-level quality annotations and enhancements. In this paper, we introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing. Specifically, DecorateLM rates texts against quality criteria, tags texts with hierarchical labels, and edits texts into a more formalized format. Due to the massive size of the pretraining corpus, adopting an LLM for decorating the entire corpus is less efficient. Therefore, to balance performance with efficiency, we curate a meticulously annotated training corpus for DecorateLM using a large language model and distill data engineering expertise into a compact 1.2 billion parameter small language model (SLM). We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM. Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
Paper Structure (43 sections, 3 equations, 10 figures, 4 tables)

This paper contains 43 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We utilize GPT-4 to assemble an annotated training corpus and integrate data engineering expertise into DecorateLM. DecorateLM is then used to process 100 billion tokens from the raw corpus, sampling 45 billion tokens using its rating and tagging capabilities to create what we refer to as the Decorated corpus. We further enhance the Decorated corpus by applying DecorateLM's editing features, making it more suitable for LLM training.
  • Figure 2: The Spearman correlations between model ratings and ground truth of validation set. Specifically, the x-axis represents the ground truth rating scores of the data. The y-axis represents the prediction rating scores of GPT-4 and DecorateLM after evaluating the validation set. Rating scores generated by GPT-4 are more discrete and inaccurate compared to DecorateLM.
  • Figure 3: Spearman correlation coefficients between various rating criteria. The correlations align with intuitive expectations. For instance, data with higher educational value often exhibits enhanced reasoning levels, which, in turn, enhances their comprehensibility.
  • Figure 4: Word cloud of tags. The size of each tag is proportional to its frequency in the annotated dataset. Tags are color-coded based on their levels: first-level tags in dark blue, second-level tags in medium blue, and third-level tags in light blue.
  • Figure 5: Evaluation of dataset rating and tagging quality using DecorateLM. The x-axis denotes the average rating of each dataset across specified dimensions, whereas the y-axis represents the cross-entropy of tags from predefined tagging system. The circle size correlates with the dataset size.
  • ...and 5 more figures