AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Michael J. Ryan; Yanzhe Zhang; Amol Salunkhe; Yi Chu; Di Xu; Diyi Yang

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang

TL;DR

AutoMetrics presents a data-efficient framework for synthesizing evaluation metrics that align with human judgments on subjective AI tasks. It combines a curated MetricBank with automatically generated LLM-based criteria and optimizes their combination via Partial Least Squares to produce interpretable predictors of quality. Across five tasks, AutoMetrics achieves higher Kendall correlations with human ratings than strong baselines and requires roughly 80 labeled examples, enabling rapid, adaptive evaluation. The work includes a Tau-Bench case study showing competitive optimization signals compared to verifiable rewards and releases an open-source toolkit to accelerate evaluation of LLM applications.

Abstract

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

TL;DR

Abstract

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)