Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga, Nisansa de Silva, Menan Velayuthan, Aloka Fernando, Charitha Rathnayake
TL;DR
This work conducts a fine-grained assessment of web-mined parallel corpora for three low-resource language pairs (English-Sinhala, English-Tamil, Sinhala-Tamil) by ranking sentence pairs with LASER-3 similarity and isolating top, bottom, and random 25k portions. Through intrinsic human evaluation and extrinsic NMT experiments across multiple corpora (CCMatrix, CCAligned, WikiMatrix, NLLB), the study shows substantial quality variance within corpora and across language pairs, with top-25k portions often yielding translation quality on par with, or superior to, full corpora and sometimes rivaling human-curated data. It also demonstrates that corpus cleaning can improve results but may not justify the associated human effort, and that embedding choices for ranking (LASER-3 vs LaBSE) influence outcomes. The findings emphasize careful corpus selection and part-wise evaluation for low-resource NMT, and suggest broader applicability of their methodology to other language pairs facing data scarcity. The work contributes a refined error taxonomy, a robust evaluation framework, and practical guidance for prioritizing high-quality web-mined data in NMT research.
Abstract
We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
