OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
Ivan Kartáč, Mateusz Lango, Ondřej Dušek
TL;DR
OpeNLGauge presents an open, reference-free NLG evaluation metric that provides precise error-span explanations by leveraging a two-stage ensemble of open-weight LLMs and a distilled 8B model. The framework uses synthetic data from a large array of NLG systems to train a cost-efficient evaluator (OpeNLGauge_ft) via LoRA, enabling robust cross-domain and cross-aspect generalization. Across seven meta-evaluation datasets, OpeNLGauge achieves competitive correlations with human judgments and superior explainability, outperforming several proprietary-model-based metrics on multiple tasks. The approach emphasizes reproducibility and accessibility, demonstrating practical impact for developers and researchers while acknowledging limitations such as multilingual coverage and potential biases in LLM outputs.
Abstract
Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.
