Table of Contents
Fetching ...

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy

Athiya Deviyani, Fernando Diaz

TL;DR

This work introduces contextual metric meta-evaluation by defining local metric accuracy and estimating it via perturbation-based pairwise comparisons, enabling assessment of how metric reliability varies across evaluation contexts in MT, ASR, and ranking. Local accuracies Acc_{\mu}(Q_x) and Acc_{\mu}(Q) quantify a metric's ability to reproduce true orderings under context-specific perturbations, with hypothesis tests showing both absolute and, in some tasks, relative context-dependent shifts. Across MT, ASR, and ranking, the study finds that both the magnitude of local accuracy and the metric rankings among candidates vary by context, emphasizing the need for context-aware metric selection and deployment. The results yield practical guidelines for diagnostic metric evaluation, highlight limitations of global meta-evaluation, and advocate adaptive metric strategies aligned with development stage and domain.

Abstract

Meta-evaluation of automatic evaluation metrics -- assessing evaluation metrics themselves -- is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement. While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. For example, we may only be interested in evaluating a specific model or class of models. We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts. This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy

TL;DR

This work introduces contextual metric meta-evaluation by defining local metric accuracy and estimating it via perturbation-based pairwise comparisons, enabling assessment of how metric reliability varies across evaluation contexts in MT, ASR, and ranking. Local accuracies Acc_{\mu}(Q_x) and Acc_{\mu}(Q) quantify a metric's ability to reproduce true orderings under context-specific perturbations, with hypothesis tests showing both absolute and, in some tasks, relative context-dependent shifts. Across MT, ASR, and ranking, the study finds that both the magnitude of local accuracy and the metric rankings among candidates vary by context, emphasizing the need for context-aware metric selection and deployment. The results yield practical guidelines for diagnostic metric evaluation, highlight limitations of global meta-evaluation, and advocate adaptive metric strategies aligned with development stage and domain.

Abstract

Meta-evaluation of automatic evaluation metrics -- assessing evaluation metrics themselves -- is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement. While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. For example, we may only be interested in evaluating a specific model or class of models. We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts. This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.

Paper Structure

This paper contains 30 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Machine Translation. Metric accuracy for machine translation metrics across the different systems.
  • Figure 2: Automatic Speech Recognition. Metric accuracy for automatic speech recognition metrics across the different Speaker IDs. (a) Speaker IDs to the left of the gray line come from the Quality=Clean LibriSpeech-100 dataset, while the Speaker IDs to the right come from the Quality=Other LibriSpeech-100 dataset.
  • Figure 3: Ranking. Metric accuracy for ranking metrics across the different systems.
  • Figure 4: Local metric accuracy across the different MQM scores for English to German (En-De) translation pairs
  • Figure 5: Local metric accuracy variance between the different number of perturbation combinations.
  • ...and 6 more figures