Table of Contents
Fetching ...

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto

TL;DR

ITEM presents a large-scale benchmark for evaluating automatic metrics against human judgments in MT and TS across six Indian languages. It assesses 26 metrics spanning lexical, embedding, neural, and LLM-based approaches, with robustness tests for outliers and perturbations. The study finds that LLM-based evaluators align most strongly with human judgments at both segment and system levels, and reveals task-specific patterns where TS metrics better capture content fidelity while MT metrics emphasize fluency. The results offer concrete guidance for designing robust, language-aware evaluation methods and set a foundation for improved multilingual metric development.

Abstract

While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

TL;DR

ITEM presents a large-scale benchmark for evaluating automatic metrics against human judgments in MT and TS across six Indian languages. It assesses 26 metrics spanning lexical, embedding, neural, and LLM-based approaches, with robustness tests for outliers and perturbations. The study finds that LLM-based evaluators align most strongly with human judgments at both segment and system levels, and reveals task-specific patterns where TS metrics better capture content fidelity while MT metrics emphasize fluency. The results offer concrete guidance for designing robust, language-aware evaluation methods and set a foundation for improved multilingual metric development.

Abstract

While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

Paper Structure

This paper contains 56 sections, 2 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: End-to-End process of dataset creation.
  • Figure 2: Distribution of human evaluation scores across evaluation aspects.
  • Figure 3: Pearson correlation network of human evaluation aspects.
  • Figure 4: Language-specific Pearson correlations of top metrics across tasks (top-right: MT, bottom-left: TS).
  • Figure 5: Pearson correlation matrices of automatic metrics.
  • ...and 9 more figures