Table of Contents
Fetching ...

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Lukáš Eigler, Jindřich Libovický, David Hurych

Abstract

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Abstract

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.
Paper Structure (18 sections, 3 equations, 1 figure, 14 tables)

This paper contains 18 sections, 3 equations, 1 figure, 14 tables.

Figures (1)

  • Figure 1: LLM as a Meta-Judge contrasted with standard metric validation with human judgment: LLMs generate damaged reference sentences, and we validate the NLG metrics by correlation of the damage level with metric values. We validate our protocol via meta-correlation, i.e., correlation with the standard metric validation.