Table of Contents
Fetching ...

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Ivan Kartáč, Mateusz Lango, Ondřej Dušek

TL;DR

OpeNLGauge presents an open, reference-free NLG evaluation metric that provides precise error-span explanations by leveraging a two-stage ensemble of open-weight LLMs and a distilled 8B model. The framework uses synthetic data from a large array of NLG systems to train a cost-efficient evaluator (OpeNLGauge_ft) via LoRA, enabling robust cross-domain and cross-aspect generalization. Across seven meta-evaluation datasets, OpeNLGauge achieves competitive correlations with human judgments and superior explainability, outperforming several proprietary-model-based metrics on multiple tasks. The approach emphasizes reproducibility and accessibility, demonstrating practical impact for developers and researchers while acknowledging limitations such as multilingual coverage and potential biases in LLM outputs.

Abstract

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

TL;DR

OpeNLGauge presents an open, reference-free NLG evaluation metric that provides precise error-span explanations by leveraging a two-stage ensemble of open-weight LLMs and a distilled 8B model. The framework uses synthetic data from a large array of NLG systems to train a cost-efficient evaluator (OpeNLGauge_ft) via LoRA, enabling robust cross-domain and cross-aspect generalization. Across seven meta-evaluation datasets, OpeNLGauge achieves competitive correlations with human judgments and superior explainability, outperforming several proprietary-model-based metrics on multiple tasks. The approach emphasizes reproducibility and accessibility, demonstrating practical impact for developers and researchers while acknowledging limitations such as multilingual coverage and potential biases in LLM outputs.

Abstract

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

Paper Structure

This paper contains 56 sections, 22 figures, 21 tables.

Figures (22)

  • Figure 1: The ensemble metric OpeNLGauge$_{ens}$ and its distilled version OpeNLGauge$_{ft}$.
  • Figure 2: Example error span annotation provided by OpeNLGauge for the narrative question answering task. The answer to the question, grounded in the story summary, is evaluated for conciseness.
  • Figure 3: Results of human evaluation of error spans and explanations. Top half of each bar: Error spans marked as correct or incorrect (hallucinated spans, no span provided, or spans without errors). Bottom half: Explanations marked as correct, partial (partially correct or incomplete) or incorrect (not addressing actual errors, vague or incorrect). The differences between TigerScore and OpeNLGauge$_{ens}$ are statistically significant (t-test, $p < 0.05$). See Table \ref{['tab:openlgauge']} for more details.
  • Figure 4: Ablation results on QAGS and TopicalChat for OpeNLGauge$_{ens}$. Plotted values represent differences in Spearman's $\rho$ correlations with human scores between the ensemble with the original prompt and the corresponding ablation. For TopicalChat, Coh. = coherence, Eng. = engagingness, Gro. = groundedness, Nat. = naturalness, Avg. = average for all aspects.
  • Figure 5: Effect of ensemble size on Spearman's $\rho$ correlations with human scores for the Wiki-DA dataset. Specific model combinations are represented by the colored patches.
  • ...and 17 more figures