Table of Contents
Fetching ...

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Wanyue Zhang, Ziyong Li, Wen Yang, Chunlin Leng, Yinan Bai, Qianlong Du, Chengqing Zong, Jiajun Zhang

TL;DR

This work introduces MDFG-tool to generate large-scale Chinese pre-training data with multi-dimensional, fine-grained annotations and releases ChineseWebText2.0, a 3.8 TB corpus annotated with quality scores, domain labels, toxicity labels, and toxicity scores. The pipeline combines handcrafted filtering with three annotators (quality, domain, toxicity) and employs LLM-in-the-loop strategies to enhance toxicity evaluation, achieving a substantial toxicity subset (3.16 GB) and strong alignment with human judgments. Empirical results demonstrate data quality controls, domain coverage across 11 areas, and robust toxicity detection, enabling researchers to curate data for domain-specific and safety-conscious LLM training. The dataset and tool-chain promise improved controllability and safety in large-scale Chinese LLM development, with plans to broaden domain coverage and add further fine-grained metadata in future work.

Abstract

During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website https://github.com/CASIA-LM/ChineseWebText-2.0

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

TL;DR

This work introduces MDFG-tool to generate large-scale Chinese pre-training data with multi-dimensional, fine-grained annotations and releases ChineseWebText2.0, a 3.8 TB corpus annotated with quality scores, domain labels, toxicity labels, and toxicity scores. The pipeline combines handcrafted filtering with three annotators (quality, domain, toxicity) and employs LLM-in-the-loop strategies to enhance toxicity evaluation, achieving a substantial toxicity subset (3.16 GB) and strong alignment with human judgments. Empirical results demonstrate data quality controls, domain coverage across 11 areas, and robust toxicity detection, enabling researchers to curate data for domain-specific and safety-conscious LLM training. The dataset and tool-chain promise improved controllability and safety in large-scale Chinese LLM development, with plans to broaden domain coverage and add further fine-grained metadata in future work.

Abstract

During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website https://github.com/CASIA-LM/ChineseWebText-2.0

Paper Structure

This paper contains 41 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The pipeline of MDFG-tool.
  • Figure 2: The Overview of BERTEval Training Data Composition and Model Architecture.
  • Figure 3: The Quality Evaluation.
  • Figure 4: The Overview of BERTEval Training Data Composition and Model Architecture.
  • Figure 5: The Architecture of Toxicity Evaluator
  • ...and 5 more figures