Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Sanjeev Kumar; Preethi Jyothi; Pushpak Bhattacharyya

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya

TL;DR

The paper tackles the challenge of evaluating MT quality in extremely low-resource languages by comparing BLEU and ChrF++ across Magahi, Bhojpuri, and Chhattisgarhi using LLM and NMT systems. It reveals substantial, task-dependent divergences between the two metrics, illustrating six artifact-related divergence patterns and arguing that ChrF++ alone can mislead by overemphasizing surface overlap in closely related scripts. Through experiments with Aya-101, Airavata, and mT5-Large on FLORES-200 devtest and NLLB Seed data, the authors show that a joint interpretation of BLEU and ChrF++ provides more robust, interpretable insights for ELRL MT evaluation. The study contributes practical guidance for diagnosing translation issues (e.g., hallucination, copying, diacritics) and emphasizes the need for multi-metric evaluation to improve reliability in data-scarce settings, with implications for future ELRL benchmarking and methodology.

Abstract

Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

TL;DR

Abstract

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Authors

TL;DR

Abstract

Table of Contents