Table of Contents
Fetching ...

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

TL;DR

This work evaluates four major web-crawled monolingual corpora (CC100, MaCoCu, mC4, OSCAR) across 11 European languages to understand how text quality translates into LM performance. It combines intrinsic human judgments of text quality with extrinsic encoder-based LM training and downstream task fine-tuning, revealing that MaCoCu and OSCAR generally offer higher-quality text, yet CC100 delivers the strongest downstream performance. The results challenge the assumption that higher-quality data automatically yields better models, showing that corpus size and other factors can override perceived quality in extrinsic evaluations. The study highlights methodological trade-offs in corpus cleaning, evaluation scope, and the relevance of dataset size, calling for nuanced considerations when selecting training data for multilingual LMs.

Abstract

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

TL;DR

This work evaluates four major web-crawled monolingual corpora (CC100, MaCoCu, mC4, OSCAR) across 11 European languages to understand how text quality translates into LM performance. It combines intrinsic human judgments of text quality with extrinsic encoder-based LM training and downstream task fine-tuning, revealing that MaCoCu and OSCAR generally offer higher-quality text, yet CC100 delivers the strongest downstream performance. The results challenge the assumption that higher-quality data automatically yields better models, showing that corpus size and other factors can override perceived quality in extrinsic evaluations. The study highlights methodological trade-offs in corpus cleaning, evaluation scope, and the relevance of dataset size, calling for nuanced considerations when selecting training data for multilingual LMs.

Abstract

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.
Paper Structure (37 sections, 2 figures, 11 tables)

This paper contains 37 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Percentage of annotated documents that are of certain length for each corpus, averaged over the seven languages that had data available in each corpus. For each bar we indicate the percentage of documents that did not fully contain running text, i.e., were annotated as Wrong Language, Not-running Text or Partially Running Text.
  • Figure 2: Average position across the four evaluation tasks plotted over the data set size (GB), for each language-corpus combination. Dotted line is the linear regression line. Note the log scale of the X-axis.