Table of Contents
Fetching ...

Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

Felermino D. M. Antonio Ali, Henrique Lopes Cardoso, Rui Sousa-Silva

TL;DR

The paper tackles the scarcity of MT evaluation data for low-resource languages by expanding FLORES+ to Emakhuwa (vmw). It translates the dev and devtest splits from Portuguese to Emakhuwa, employing a peer-review workflow with data preparation, translation, revision, and a Direct Assessment validation pipeline, including control items and orthography judgments. The study benchmarks baseline neural MT against multiple multilingual models (e.g., mT5, ByT5, M2M-100, NLLB-200, AfriByT5/AfriMT5), showing that ByT5-family models yield the strongest gains, especially when using multiple reference translations. The results reveal persistent orthography-related challenges in Emakhuwa and underscore the value of reference diversity for improving translation quality in low-resource languages; the dataset is publicly available for future research and benchmarking.

Abstract

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.

Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

TL;DR

The paper tackles the scarcity of MT evaluation data for low-resource languages by expanding FLORES+ to Emakhuwa (vmw). It translates the dev and devtest splits from Portuguese to Emakhuwa, employing a peer-review workflow with data preparation, translation, revision, and a Direct Assessment validation pipeline, including control items and orthography judgments. The study benchmarks baseline neural MT against multiple multilingual models (e.g., mT5, ByT5, M2M-100, NLLB-200, AfriByT5/AfriMT5), showing that ByT5-family models yield the strongest gains, especially when using multiple reference translations. The results reveal persistent orthography-related challenges in Emakhuwa and underscore the value of reference diversity for improving translation quality in low-resource languages; the dataset is publicly available for future research and benchmarking.

Abstract

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.
Paper Structure (25 sections, 10 figures, 7 tables)

This paper contains 25 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Workflow
  • Figure 2: Averaged Translation Quality Score Histogram on both dev and devtest sets. Translations with an average score below 70 (indicated by the red line) were returned to the translator for rework.
  • Figure 3: Direct assessment adequacy scores per Annotator on dev set
  • Figure 4: Direct assessment adequacy scores per Annotator on devtest set
  • Figure 5: Direct assessment adequacy scores per Annotator on control set
  • ...and 5 more figures