Table of Contents
Fetching ...

Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

Mingqi Gao, Xinyu Hu, Li Lin, Xiaojun Wan

TL;DR

This paper interrogates how the choice of correlation measure—specifically grouping method and coefficient—affects meta-evaluation in NLG. It analyzes 12 correlation measures across six real-world datasets and 32 metrics (including LLM-based evaluators), introducing three perspectives: discriminative power, ranking consistency, and granularity sensitivity. The authors find that global-level grouping combined with Pearson $r$ generally yields the strongest meta-evaluation performance, while system-level grouping and Kendall $\tau$ are more brittle to score granularity. The work provides empirical guidance for selecting correlation measures in NLG meta-evaluation and highlights the impact of methodological choices on metric ranking and fairness across tasks and scales.

Abstract

The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation: discriminative power, ranking consistency, and sensitivity to score granularity. We find that the measure using global grouping and Pearson correlation coefficient exhibits the best performance in both discriminative power and ranking consistency. Besides, the measures using system-level grouping or Kendall correlation are the least sensitive to score granularity.

Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation

TL;DR

This paper interrogates how the choice of correlation measure—specifically grouping method and coefficient—affects meta-evaluation in NLG. It analyzes 12 correlation measures across six real-world datasets and 32 metrics (including LLM-based evaluators), introducing three perspectives: discriminative power, ranking consistency, and granularity sensitivity. The authors find that global-level grouping combined with Pearson generally yields the strongest meta-evaluation performance, while system-level grouping and Kendall are more brittle to score granularity. The work provides empirical guidance for selecting correlation measures in NLG meta-evaluation and highlights the impact of methodological choices on metric ranking and fairness across tasks and scales.

Abstract

The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation: discriminative power, ranking consistency, and sensitivity to score granularity. We find that the measure using global grouping and Pearson correlation coefficient exhibits the best performance in both discriminative power and ranking consistency. Besides, the measures using system-level grouping or Kendall correlation are the least sensitive to score granularity.

Paper Structure

This paper contains 25 sections, 3 equations, 51 figures, 7 tables, 3 algorithms.

Figures (51)

  • Figure 1: The consistency of evaluation metric rankings using different correlation measures on SummEval, calculated through Kendall's correlation coefficient.
  • Figure 2: DP values of different correlation measures on all meta-evaluation datasets using the permutation test, the lower the better. Each column "Dn" shows the result on one dataset, which corresponds to the original dataset as shown in Table \ref{['tab:subdata_parameter']}. The first column presents the overall performance with the averaged results of all datasets.
  • Figure 3: RC values of different correlation measures on all meta-evaluation datasets, the higher the better, with the representation of columns similar to Figure \ref{['fig:DP']}.
  • Figure 4: As the changes of $G^m$, the correlations between the GPT-4-Turbo evaluator and human evaluation using different measures on SummEval (left) and WMT23 (right) with the fixed evaluation scale of 1-5.
  • Figure 5: As the changes of $G^m$, the correlations between metrics and humans using different measures in statistical simulation with $G^h=13$.
  • ...and 46 more figures