Table of Contents
Fetching ...

A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

Jorge del Pozo Lérida, Kamil Kojs, János Máté, Mikołaj Antoni Barański, Christian Hardmeier

TL;DR

The study systematically compares data-filtering techniques (LASER, LaBSE, MUSE) for domain-adapted English–Polish biomedical MT using mBART50. By filtering the UFAL Medical Corpus into 20% and 60% subsets, the authors fine-tune models and evaluate on Khresmoi with SacreBLEU, complemented by bilingual human checks. LASER consistently yields the best or comparable performance with reduced data, outpacing LaBSE and often surpassing random baselines, while MUSE is effective but less consistent. The findings support LASER as the go-to filtering method for this language pair and domain, with notable implications for training efficiency and translation fluency in specialized biomedical MT tasks.

Abstract

Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.

A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

TL;DR

The study systematically compares data-filtering techniques (LASER, LaBSE, MUSE) for domain-adapted English–Polish biomedical MT using mBART50. By filtering the UFAL Medical Corpus into 20% and 60% subsets, the authors fine-tune models and evaluate on Khresmoi with SacreBLEU, complemented by bilingual human checks. LASER consistently yields the best or comparable performance with reduced data, outpacing LaBSE and often surpassing random baselines, while MUSE is effective but less consistent. The findings support LASER as the go-to filtering method for this language pair and domain, with notable implications for training efficiency and translation fluency in specialized biomedical MT tasks.

Abstract

Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.

Paper Structure

This paper contains 11 sections, 3 tables.