Table of Contents
Fetching ...

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei

TL;DR

OpenCSG Chinese Corpus introduces four high-quality Chinese data resources—Fineweb-Edu-Chinese, Fineweb-Edu-Chinese-V2, Cosmopedia-Chinese, and Smoltalk-Chinese—to advance LLM pretraining, post-training, and fine-tuning. By combining education-focused filtering, synthetic textbook-like content, and diverse multi-turn dialogues with automated scoring and deduplication, the work demonstrates dataset-specific gains on benchmarks like CMMLU and CEval and highlights the strongest alignment improvements from Smoltalk-Chinese. The results underscore the value of curated, diverse data for Chinese LLM development while acknowledging limitations such as data homogeneity and markdown formatting, and they propose future directions involving real-world data blending and broader evaluation. Overall, the OpenCSG effort advances scalable, open, and high-quality Chinese corpora to accelerate community-driven improvements in Chinese NLP.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

TL;DR

OpenCSG Chinese Corpus introduces four high-quality Chinese data resources—Fineweb-Edu-Chinese, Fineweb-Edu-Chinese-V2, Cosmopedia-Chinese, and Smoltalk-Chinese—to advance LLM pretraining, post-training, and fine-tuning. By combining education-focused filtering, synthetic textbook-like content, and diverse multi-turn dialogues with automated scoring and deduplication, the work demonstrates dataset-specific gains on benchmarks like CMMLU and CEval and highlights the strongest alignment improvements from Smoltalk-Chinese. The results underscore the value of curated, diverse data for Chinese LLM development while acknowledging limitations such as data homogeneity and markdown formatting, and they propose future directions involving real-world data blending and broader evaluation. Overall, the OpenCSG effort advances scalable, open, and high-quality Chinese corpora to accelerate community-driven improvements in Chinese NLP.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.
Paper Structure (16 sections, 8 figures)

This paper contains 16 sections, 8 figures.

Figures (8)

  • Figure 1: The diagram illustrates the construction pipelines for three Chinese datasets: FineWeb-edu-Chinese, COSMOPEDIA-Chinese, and Smoltalk-Chinese. The FineWeb-edu-Chinese pipeline begins with various Chinese corpora, followed by random sampling, data pooling, annotation, and scoring. A fine-tuned BERT-based model is trained to filter and generate the final dataset. The COSMOPEDIA-Chinese pipeline starts with seed data collection, proceeds through prompt design and data generation, and results in a database of curated knowledge. Lastly, the Smoltalk-Chinese pipeline leverages powerful Chinese LLMs with task-specific system prompts to generate conversational data.
  • Figure 2: Score distribution of all the unfiltered source data scored by the Fineweb-Edu-Chinese-v2 scorer. High-quality samples (score $>3$) form only a small fraction, indicating the scarcity of valuable data in open-source Chinese corpora.
  • Figure 3: (a) The text length distribution of Fineweb-Edu-Chinese shows most samples' lengths are in the interval 0.2k-1k and 1k-2k. (b) The source where the samples are from in Fineweb-Edu-Chinese.
  • Figure 4: (a) The text length distribution of Fineweb-Edu-Chinese-v2 shows most samples' lengths are iin the interval 0.2k-1k. (b) The source where the samples are from in Fineweb-Edu-Chinese-v2.
  • Figure 5: The score in CMMLU and CEval of each checkpoint when training with different datasets.
  • ...and 3 more figures