LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Lukáš Eigler; Jindřich Libovický; David Hurych

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Lukáš Eigler, Jindřich Libovický, David Hurych

Abstract

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Abstract

Paper Structure (18 sections, 3 equations, 1 figure, 14 tables)

This paper contains 18 sections, 3 equations, 1 figure, 14 tables.

Introduction
Evaluation Metric Validation
LLM as a Meta-Judge
Metric Validation Using Meta-Judge
Meta-Correlation Analysis
Experiments
Tasks
Question Answering.
Summarization.
Machine Translation.
Automatic Metrics
Synthetic Data Generation
Results
Related Work
Conclusions
...and 3 more sections

Figures (1)

Figure 1: LLM as a Meta-Judge contrasted with standard metric validation with human judgment: LLMs generate damaged reference sentences, and we validate the NLG metrics by correlation of the damage level with metric values. We validate our protocol via meta-correlation, i.e., correlation with the standard metric validation.

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Abstract

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Authors

Abstract

Table of Contents

Figures (1)