Table of Contents
Fetching ...

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Yaoyao Chang, Lei Cui, Li Dong, Shaohan Huang, Yangyu Huang, Yupan Huang, Scarlett Li, Tengchao Lv, Shuming Ma, Qinzheng Sun, Wenhui Wang, Furu Wei, Ying Xin, Mao Yang, Qiufeng Yin, Xingxing Zhang

TL;DR

The paper tackles the data bottleneck in large language model pre-training by harnessing Common Crawl to build broad and domain-specific datasets. It introduces RedStone, a scalable Extraction and Filtering pipeline that yields RedStone-Web, RedStone-Code, RedStone-Math, and RedStone-QA, collectively totaling 3.48 trillion tokens and enabling targeted improvements across general language, code generation, mathematics, and QA tasks. The results show RedStone surpassing several open-source baselines on common-sense benchmarks, code and math reasoning, and QA datasets, demonstrating the value of web-scale, quality-filtered data for domain adaptation. The authors also commit to open-sourcing the pipeline and data to foster reproducibility and broader adoption in developing competitive LLMs.

Abstract

Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at \url{https://aka.ms/redstone}.

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

TL;DR

The paper tackles the data bottleneck in large language model pre-training by harnessing Common Crawl to build broad and domain-specific datasets. It introduces RedStone, a scalable Extraction and Filtering pipeline that yields RedStone-Web, RedStone-Code, RedStone-Math, and RedStone-QA, collectively totaling 3.48 trillion tokens and enabling targeted improvements across general language, code generation, mathematics, and QA tasks. The results show RedStone surpassing several open-source baselines on common-sense benchmarks, code and math reasoning, and QA datasets, demonstrating the value of web-scale, quality-filtered data for domain adaptation. The authors also commit to open-sourcing the pipeline and data to foster reproducibility and broader adoption in developing competitive LLMs.

Abstract

Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at \url{https://aka.ms/redstone}.

Paper Structure

This paper contains 40 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Using RedStone, we created two types of data: general domain data and domain-specific data. General domain data comprises RedStone-Web, it does not specify a data domain, allowing the model to learn common knowledge across various domains. Domain-specific data includes RedStone-Code, RedStone-Math, and RedStone-QA, enabling the model to acquire specialized knowledge in particular areas or formats. Each example type features the original webpage screenshot on the left and the corresponding data processed by RedStone on the right.
  • Figure 2: Subsequent stages of RedStone-Web. RedStone processes Common Crawl data in separate steps, handling WARC and WET files independently before merging them to increase the token count. Over 99% of the tokens in Common Crawl are removed during processing. Since WARC files are in HTML format and inconvenient for token counting, and WARC and WET files represent different forms of the same data, the token count from WET files is used as the original token count for both formats.