Table of Contents
Fetching ...

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Marek Masiak, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Xuanli He, Sofia Bourhim, Andiswa Bukula, Muhidin Mohamed, Temitayo Olatoye, Tosin Adewumi, Hamam Mokayed, Christine Mwase, Wangui Kimotho, Foutse Yuehgoh, Anuoluwapo Aremu, Jessica Ojo, Shamsuddeen Hassan Muhammad, Salomey Osei, Abdul-Hakeem Omotayo, Chiamaka Chukwuneke, Perez Ogayo, Oumaima Hourrane, Salma El Anigri, Lolwethu Ndolela, Thabiso Mangwana, Shafie Abdi Mohamed, Ayinde Hassan, Oluwabusayo Olufunke Awoyomi, Lama Alkhaled, Sana Al-Azzawi, Naome A. Etori, Millicent Ochieng, Clemencia Siro, Samuel Njoroge, Eric Muchiri, Wangari Kimotho, Lyse Naomi Wamba Momo, Daud Abolade, Simbiat Ajao, Iyanuoluwa Shode, Ricky Macharm, Ruqayya Nasir Iro, Saheed S. Abdullahi, Stephen E. Moore, Bernard Opoku, Zainab Akinjobi, Abeeb Afolabi, Nnaemeka Obiefuna, Onyekachi Raphael Ogbu, Sam Brian, Verrah Akinyi Otiende, Chinedu Emmanuel Mbonu, Sakayo Toadoum Sari, Yao Lu, Pontus Stenetorp

TL;DR

The paper tackles the challenge of evaluating machine translation for under-resourced African languages by moving beyond BLEU toward human-aligned metrics. It introduces AfriMTE, a high-quality human evaluation dataset with simplified MQM-aligned Direct Assessment for 13 African languages, and AfriCOMET, an Africa-focused COMET-based MT evaluation metric that leverages transfer learning from well-resourced data and AfroXLM-R-L. It also presents AfriCOMET-QE, a reference-free quality estimation framework, and benchmarks its performance against strong baselines, including GPT-4 prompting. Across extensive experiments, AfroXLM-R-L-based models achieve state-of-the-art Spearman correlations with human judgments (up to 0.441) and demonstrate robust generalization, while the work emphasizes data-quality, domain variety, and open release to spur further research. The approach offers a practical path to reliable MT evaluation and QE for African languages, with implications for broader multilingual NLP evaluation in low-resource contexts.

Abstract

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

TL;DR

The paper tackles the challenge of evaluating machine translation for under-resourced African languages by moving beyond BLEU toward human-aligned metrics. It introduces AfriMTE, a high-quality human evaluation dataset with simplified MQM-aligned Direct Assessment for 13 African languages, and AfriCOMET, an Africa-focused COMET-based MT evaluation metric that leverages transfer learning from well-resourced data and AfroXLM-R-L. It also presents AfriCOMET-QE, a reference-free quality estimation framework, and benchmarks its performance against strong baselines, including GPT-4 prompting. Across extensive experiments, AfroXLM-R-L-based models achieve state-of-the-art Spearman correlations with human judgments (up to 0.441) and demonstrate robust generalization, while the work emphasizes data-quality, domain variety, and open release to spur further research. The approach offers a practical path to reliable MT evaluation and QE for African languages, with implications for broader multilingual NLP evaluation in low-resource contexts.

Abstract

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).
Paper Structure (27 sections, 8 figures, 21 tables)

This paper contains 27 sections, 8 figures, 21 tables.

Figures (8)

  • Figure 1: The screenshot of the user interface with an adequacy annotated task comprising the source sentence and its corresponding translation in English-Yoruba.
  • Figure 2: Translation quality of all qualified annotated translations as measured by raw DA scores across all language pairs and domains in ascending order, with medians displayed in the plot for adequacy (upper) and fluency (lower).
  • Figure 3: Adequacy annotation guideline for error highlighting [the first part] and DA score assignment [the second part].
  • Figure 4: Fluency annotation guideline for error highlighting [the first part] and DA score assignment [the second part].
  • Figure 5: Counts of each error category and sentence-level translation quality measured by DA scores across all language pairs and domains for adequacy.
  • ...and 3 more figures