Table of Contents
Fetching ...

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, Charitha Rathnayake

TL;DR

This work conducts a fine-grained assessment of web-mined parallel corpora for three low-resource language pairs (English-Sinhala, English-Tamil, Sinhala-Tamil) by ranking sentence pairs with LASER-3 similarity and isolating top, bottom, and random 25k portions. Through intrinsic human evaluation and extrinsic NMT experiments across multiple corpora (CCMatrix, CCAligned, WikiMatrix, NLLB), the study shows substantial quality variance within corpora and across language pairs, with top-25k portions often yielding translation quality on par with, or superior to, full corpora and sometimes rivaling human-curated data. It also demonstrates that corpus cleaning can improve results but may not justify the associated human effort, and that embedding choices for ranking (LASER-3 vs LaBSE) influence outcomes. The findings emphasize careful corpus selection and part-wise evaluation for low-resource NMT, and suggest broader applicability of their methodology to other language pairs facing data scarcity. The work contributes a refined error taxonomy, a robust evaluation framework, and practical guidance for prioritizing high-quality web-mined data in NMT research.

Abstract

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

TL;DR

This work conducts a fine-grained assessment of web-mined parallel corpora for three low-resource language pairs (English-Sinhala, English-Tamil, Sinhala-Tamil) by ranking sentence pairs with LASER-3 similarity and isolating top, bottom, and random 25k portions. Through intrinsic human evaluation and extrinsic NMT experiments across multiple corpora (CCMatrix, CCAligned, WikiMatrix, NLLB), the study shows substantial quality variance within corpora and across language pairs, with top-25k portions often yielding translation quality on par with, or superior to, full corpora and sometimes rivaling human-curated data. It also demonstrates that corpus cleaning can improve results but may not justify the associated human effort, and that embedding choices for ranking (LASER-3 vs LaBSE) influence outcomes. The findings emphasize careful corpus selection and part-wise evaluation for low-resource NMT, and suggest broader applicability of their methodology to other language pairs facing data scarcity. The work contributes a refined error taxonomy, a robust evaluation framework, and practical guidance for prioritizing high-quality web-mined data in NMT research.

Abstract

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
Paper Structure (33 sections, 1 equation, 8 figures, 20 tables)

This paper contains 33 sections, 1 equation, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Vanilla-transformer performance trained on Top, Bottom and Random 25K splits of NLLB, CCMatrix, CCAligned and WikiMatrix for En-Si (higher the better).
  • Figure 2: NMT results of different models trained on CCMatrix En-Si top, bottom and average 25K splits.
  • Figure 3: NMT results of vanilla transformer model trained on CCMatrix En-Si in jumps of 100K.
  • Figure 4: Vanilla transformer results for En-Si original NLLB Top 25K, NLLB cleaned Top 25K, NLLB cleaned full(27K+), SITA Top 25K, and SITA Random 25K.
  • Figure 5: Vanilla transformer results for En-Ta original NLLB Top 25K, EnTa NLLB cleaned Top 25K, EnTa NLLB cleaned full(26K+), EnTa SITA Top 25K, and EnTa SITA Random 25K.
  • ...and 3 more figures