Table of Contents
Fetching ...

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

Donghao Huang, Shila Chew, Anna Dutkiewicz, Zhaoxia Wang

TL;DR

This paper tackles the challenge of scalable, semantic evaluation of software test coverage by introducing LLM-as-a-Judge (LAJ), a rubric-driven framework that outputs structured JSON and novel reliability metrics for Gherkin acceptance tests. It conducts a comprehensive assessment across 20 model configurations, 100 expert-annotated scripts, and 5 runs, revealing that smaller models like GPT-4o Mini can outperform larger models in both accuracy and reliability while offering substantial cost savings. The study introduces Evaluation Completion Rate (ECR@1) and adjusted cost metrics to capture deployment realities, and provides production guidance favoring low-cost, high-reliability configurations. The work offers data, framework, and code to support adoption in CI/CD QA pipelines and reproducibility across domains.

Abstract

Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while improving accuracy. Reasoning effort is model-family dependent: GPT-5 benefits from increased reasoning (with predictable accuracy-cost tradeoffs), whereas open-weight models degrade across all dimensions as reasoning increases. Overall, cost spans 175x ($0.45-$78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.

LLM-as-a-Judge for Scalable Test Coverage Evaluation: Accuracy, Operational Reliability, and Cost

TL;DR

This paper tackles the challenge of scalable, semantic evaluation of software test coverage by introducing LLM-as-a-Judge (LAJ), a rubric-driven framework that outputs structured JSON and novel reliability metrics for Gherkin acceptance tests. It conducts a comprehensive assessment across 20 model configurations, 100 expert-annotated scripts, and 5 runs, revealing that smaller models like GPT-4o Mini can outperform larger models in both accuracy and reliability while offering substantial cost savings. The study introduces Evaluation Completion Rate (ECR@1) and adjusted cost metrics to capture deployment realities, and provides production guidance favoring low-cost, high-reliability configurations. The work offers data, framework, and code to support adoption in CI/CD QA pipelines and reproducibility across domains.

Abstract

Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost (0.45-$78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.

Paper Structure

This paper contains 38 sections, 9 equations, 1 table.