Table of Contents
Fetching ...

LaajMeter: A Framework for LaaJ Evaluation

Samuel Ackerman, Gal Amram, Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Avi Ziv

TL;DR

This work tackles the problem of evaluating LLM-based evaluators (LaaJs) in domain-specific NLP tasks where annotated data are scarce. It introduces LaaJMeter, a simulation-based framework that generates synthetic virtual models and LaaJs to enable controlled meta-evaluation of evaluation metrics and thresholding without external ground truth. Through a code-translation use case (COBOL to Java) on IBM WCA4Z, the authors compare metrics such as the $t$-test, Kendall-$\tau$ correlation, and an ordering experiment, finding that the $t$-test can be misleading while Kendall-$\tau$ provides a robust signal even with approximate distance estimates (e.g., $\tau \approx 0.70$). The study demonstrates distance-aware interpretation is essential for reliable meta-evaluation and positions LaaJMeter as a scalable tool for trustworthy and reproducible evaluation in low-resource settings.

Abstract

Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). The analysis of a LaaJ software, commonly refereed to as meta-evaluation, pose significant challenges in domain-specific contexts. In such domains, in contrast to general domains, annotated data is scarce and expert evaluation is costly. As a result, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. Therefore, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate LaaJs for specific tasks: they can test whether their metrics correctly distinguish between high and low quality (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.

LaajMeter: A Framework for LaaJ Evaluation

TL;DR

This work tackles the problem of evaluating LLM-based evaluators (LaaJs) in domain-specific NLP tasks where annotated data are scarce. It introduces LaaJMeter, a simulation-based framework that generates synthetic virtual models and LaaJs to enable controlled meta-evaluation of evaluation metrics and thresholding without external ground truth. Through a code-translation use case (COBOL to Java) on IBM WCA4Z, the authors compare metrics such as the -test, Kendall- correlation, and an ordering experiment, finding that the -test can be misleading while Kendall- provides a robust signal even with approximate distance estimates (e.g., ). The study demonstrates distance-aware interpretation is essential for reliable meta-evaluation and positions LaaJMeter as a scalable tool for trustworthy and reproducible evaluation in low-resource settings.

Abstract

Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). The analysis of a LaaJ software, commonly refereed to as meta-evaluation, pose significant challenges in domain-specific contexts. In such domains, in contrast to general domains, annotated data is scarce and expert evaluation is costly. As a result, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. Therefore, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate LaaJs for specific tasks: they can test whether their metrics correctly distinguish between high and low quality (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.

Paper Structure

This paper contains 21 sections, 3 tables.