Table of Contents
Fetching ...

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

Yingli Shen, Wen Lai, Shuo Wang, Xueren Zhang, Kangyang Luo, Alexander Fraser, Maosong Sun

TL;DR

DCAD-2000 introduces a large-scale multilingual corpus spanning 2,282 languages with 46.72TB of text and 8.63B documents, created from Common Crawl data and existing sources. It reframes data cleaning as anomaly detection, employing eight interpretable features and an Isolation Forest to dynamically filter noisy content, avoiding manual thresholds. Empirical results show that training decoder-only LLMs with DCAD-2000 improves data quality and downstream multilingual performance, especially for low-resource languages, across multiple benchmarks. The authors release the dataset and tooling publicly, enabling reproducible multilingual pretraining and offering a scalable approach to high-quality multilingual data curation.

Abstract

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

TL;DR

DCAD-2000 introduces a large-scale multilingual corpus spanning 2,282 languages with 46.72TB of text and 8.63B documents, created from Common Crawl data and existing sources. It reframes data cleaning as anomaly detection, employing eight interpretable features and an Isolation Forest to dynamically filter noisy content, avoiding manual thresholds. Empirical results show that training decoder-only LLMs with DCAD-2000 improves data quality and downstream multilingual performance, especially for low-resource languages, across multiple benchmarks. The authors release the dataset and tooling publicly, enabling reproducible multilingual pretraining and offering a scalable approach to high-quality multilingual data curation.

Abstract

The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.

Paper Structure

This paper contains 33 sections, 7 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Scatter plots of eight features extracted from a Chinese corpus during the data cleaning process, with data points color-coded according to their anomaly labels. The yellow points represent high-quality data, while the purple points indicate low-quality data.
  • Figure 2: Document distribution and linguistic diversity in DCAD-2000.
  • Figure 3: The performance comparison of models trained using various data cleaning methods.
  • Figure 4: Comparison of DCAD-2000 with existing multilingual corpora for three languages—French, Chinese, and Turkish—evaluated using different multilingual LLMs.
  • Figure 5: Distribution of average word counts across different languages, sources, and shards in the New CC dataset.