Table of Contents
Fetching ...

gaHealth: An English-Irish Bilingual Corpus of Health Data

Séamus Lankford, Haithem Afli, Órla Ní Loinsigh, Andy Way

TL;DR

gaHealth addresses the scarcity of health-domain parallel data for English–Irish by constructing a dedicated bilingual corpus from official Irish health documents and Covid data. It provides a modular extraction/alignment/cleaning toolchain and trains Transformer-based models on the in-domain data, reporting substantial translation gains over LoResMT2021 baselines (up to 22.2 BLEU points, ~40%). The paper also offers linguistic guidelines for corpus construction and demonstrates strong GA↔EN improvements (GA-EN BLEU 57.6). The dataset is released online to support future research in low-resource Irish NLP.

Abstract

Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.

gaHealth: An English-Irish Bilingual Corpus of Health Data

TL;DR

gaHealth addresses the scarcity of health-domain parallel data for English–Irish by constructing a dedicated bilingual corpus from official Irish health documents and Covid data. It provides a modular extraction/alignment/cleaning toolchain and trains Transformer-based models on the in-domain data, reporting substantial translation gains over LoResMT2021 baselines (up to 22.2 BLEU points, ~40%). The paper also offers linguistic guidelines for corpus construction and demonstrates strong GA↔EN improvements (GA-EN BLEU 57.6). The dataset is released online to support future research in low-resource Irish NLP.

Abstract

Machine Translation is a mature technology for many high-resource language pairs. However in the context of low-resource languages, there is a paucity of parallel data datasets available for developing translation models. Furthermore, the development of datasets for low-resource languages often focuses on simply creating the largest possible dataset for generic translation. The benefits and development of smaller in-domain datasets can easily be overlooked. To assess the merits of using in-domain data, a dataset for the specific domain of health was developed for the low-resource English to Irish language pair. Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for the health domain. In the context of translating health-related data, models developed using the gaHealth corpus demonstrated a maximum BLEU score improvement of 22.2 points (40%) when compared with top performing models from the LoResMT2021 Shared Task. Furthermore, we define linguistic guidelines for developing gaHealth, the first bilingual corpus of health data for the Irish language, which we hope will be of use to other creators of low-resource data sets. gaHealth is now freely available online and is ready to be explored for further research.
Paper Structure (25 sections, 4 figures, 6 tables)

This paper contains 25 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Corpus development process. In developing the corpus, the key steps of data collection, pre-processing, alignment and validation were followed. The role of the toolchain at various stages is highlighted.
  • Figure 2: gaHealth en2ga* system: training EN-GA model with combined 16k gaHealth corpus and 8k LoResMT2021 covid corpus achieving a max validation accuracy of 38.5% and perplexity of 111 after 40k steps. BLEU score: 37.6.
  • Figure 3: adapt covid_extended system: training EN-GA model with 8k LoResMT2021 covid corpus achieving a max validation accuracy of 30.0% and perplexity of 354 after 30k steps. BLEU score: 36.0.
  • Figure 4: gaHealth ga2en system: training GA-EN model with combined 16k gaHealth corpus and 8k LoResMT2021 covid corpus achieving a max validation accuracy of 39.5% and perplexity of 116 after 40k steps. BLEU score: 57.6.