Table of Contents
Fetching ...

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Liangdong Wang, Bo-Wen Zhang, Chengwei Wu, Hanyu Zhao, Xiaofeng Shi, Shuhao Gu, Jijie Li, Quanyue Ma, TengFei Pan, Guang Liu

TL;DR

This work presents CCI3.0-HQ, a high-quality 500GB subset of the Chinese Corpora Internet 3.0, developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality and believes this open-access dataset will facilitate broader access to high-quality language models.

Abstract

We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

TL;DR

This work presents CCI3.0-HQ, a high-quality 500GB subset of the Chinese Corpora Internet 3.0, developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality and believes this open-access dataset will facilitate broader access to high-quality language models.

Abstract

We present CCI3.0-HQ (https://huggingface.co/datasets/BAAI/CCI3-HQ), a high-quality 500GB subset of the Chinese Corpora Internet 3.0 (CCI3.0)(https://huggingface.co/datasets/BAAI/CCI3-Data), developed using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI3.0, SkyPile, and WanjuanV1. The high-quality filtering process effectively distills the capabilities of the Qwen2-72B-instruct model into a compact 0.5B model, attaining optimal F1 scores for Chinese web data classification. We believe this open-access dataset will facilitate broader access to high-quality language models.

Paper Structure

This paper contains 17 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Dataset Curation Pipeline
  • Figure 2: Effects of Backbone Freezing and Learning Rate Adjustments on Classifier Tuning Performance
  • Figure 3: Mixed Dataset Experiment
  • Figure 4: Chinese Dataset Experiment