Table of Contents
Fetching ...

BMX: Boosting Natural Language Generation Metrics with Explainability

Christoph Leiter, Hoa Nguyen, Steffen Eger

TL;DR

BMX tackles the problem of opaque NLG evaluation metrics by introducing a model-agnostic, explainability-driven boost. The method computes word-level importances for a base segment-level metric, aggregates them with a power-mean function, and linearly combines the result with the original score, enabling iterative refinement. Across MT and summarization datasets, BMX yields consistent improvements, notably achieving about 0.087 points in system-level Spearman on SummEval and showing significant gains on several MQM and SummEval splits, while highlighting the importance of parameter calibration. This work advances metric reliability and interpretability, offering a practical framework that can be applied to a broad class of NLG metrics and potentially other regression/classification tasks. The approach emphasizes explainability as an operational tool for metric improvement, with robust performance contingent on careful cross-domain calibration and explainability choices.

Abstract

State-of-the-art natural language generation evaluation metrics are based on black-box language models. Hence, recent works consider their explainability with the goals of better understandability for humans and better metric analysis, including failure cases. In contrast, our proposed method BMX: Boosting Natural Language Generation Metrics with explainability explicitly leverages explanations to boost the metrics' performance. In particular, we perceive feature importance explanations as word-level scores, which we convert, via power means, into a segment-level score. We then combine this segment-level score with the original metric to obtain a better metric. Our tests show improvements for multiple metrics across MT and summarization datasets. While improvements in machine translation are small, they are strong for summarization. Notably, BMX with the LIME explainer and preselected parameters achieves an average improvement of 0.087 points in Spearman correlation on the system-level evaluation of SummEval.

BMX: Boosting Natural Language Generation Metrics with Explainability

TL;DR

BMX tackles the problem of opaque NLG evaluation metrics by introducing a model-agnostic, explainability-driven boost. The method computes word-level importances for a base segment-level metric, aggregates them with a power-mean function, and linearly combines the result with the original score, enabling iterative refinement. Across MT and summarization datasets, BMX yields consistent improvements, notably achieving about 0.087 points in system-level Spearman on SummEval and showing significant gains on several MQM and SummEval splits, while highlighting the importance of parameter calibration. This work advances metric reliability and interpretability, offering a practical framework that can be applied to a broad class of NLG metrics and potentially other regression/classification tasks. The approach emphasizes explainability as an operational tool for metric improvement, with robust performance contingent on careful cross-domain calibration and explainability choices.

Abstract

State-of-the-art natural language generation evaluation metrics are based on black-box language models. Hence, recent works consider their explainability with the goals of better understandability for humans and better metric analysis, including failure cases. In contrast, our proposed method BMX: Boosting Natural Language Generation Metrics with explainability explicitly leverages explanations to boost the metrics' performance. In particular, we perceive feature importance explanations as word-level scores, which we convert, via power means, into a segment-level score. We then combine this segment-level score with the original metric to obtain a better metric. Our tests show improvements for multiple metrics across MT and summarization datasets. While improvements in machine translation are small, they are strong for summarization. Notably, BMX with the LIME explainer and preselected parameters achieves an average improvement of 0.087 points in Spearman correlation on the system-level evaluation of SummEval.
Paper Structure (40 sections, 5 equations, 9 figures, 5 tables)

This paper contains 40 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The duality of segment-level natural language generation evaluation metrics (right) and their word-level explanations (left).
  • Figure 2: Box-plots of the $w$ and $p$ values for XBERTScore leading to improvements with different explainers across all settings of the MT calibration sets. Md denotes the Median value.
  • Figure 3: System-level correlation with BARTScore on the first test split of SummEval. Left columns show the original correlation, middle columns show the correlation with BARTScore fine-tuned on the calibration set and right columns show the correlation with BMX.
  • Figure 4: System-level correlation with BERTScore on RealSumm, across $p$ values from $-30$ to $30$ and across $w$ values from $0$ to $1$, where $w=1$ is the original metric (indicated by a black line). BMX is using LIME in this sample.
  • Figure 5: Average Pearson correlation between 3 repeated runs of BMX with LIME and different settings of $w$ on the x-axis. The tests were computed on 3 language-pairs from WMT22 and the $p$-values range from -30 to 30 for every $w$ setting.
  • ...and 4 more figures