SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani
TL;DR
This paper tackles the challenge of evaluating MT quality for under-resourced African languages by introducing SSA-MTE, a large human-annotated MTE dataset covering 14 language pairs with over $73{,}000$ sentence-level annotations. Building on this dataset, the authors develop SSA-COMET and SSA-COMET-QE, regression-based MTE and reference-free QE metrics, and assess prompting-based LLMs for evaluation. The results show SSA-COMET-MTL achieves superior or competitive performance versus prior metrics and strong LLM baselines, with notable gains on very low-resource languages like Twi, Luo, and Yorùbá, while offering substantial efficiency advantages. The work emphasizes open data and models to advance reproducibility and regional applicability in African NLP, while also discussing limitations and avenues for future work, including broader domain coverage and fluency/consistency aspects.
Abstract
Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 14 African language pairs from the News domain, with over 73,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o, Claude-3.7 and Gemini 2.5 Pro. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM Gemini 2.5 Pro evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.
