Table of Contents
Fetching ...

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang

TL;DR

AutoMetrics presents a data-efficient framework for synthesizing evaluation metrics that align with human judgments on subjective AI tasks. It combines a curated MetricBank with automatically generated LLM-based criteria and optimizes their combination via Partial Least Squares to produce interpretable predictors of quality. Across five tasks, AutoMetrics achieves higher Kendall correlations with human ratings than strong baselines and requires roughly 80 labeled examples, enabling rapid, adaptive evaluation. The work includes a Tau-Bench case study showing competitive optimization signals compared to verifiable rewards and releases an open-source toolkit to accelerate evaluation of LLM applications.

Abstract

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

TL;DR

AutoMetrics presents a data-efficient framework for synthesizing evaluation metrics that align with human judgments on subjective AI tasks. It combines a curated MetricBank with automatically generated LLM-based criteria and optimizes their combination via Partial Least Squares to produce interpretable predictors of quality. Across five tasks, AutoMetrics achieves higher Kendall correlations with human ratings than strong baselines and requires roughly 80 labeled examples, enabling rapid, adaptive evaluation. The work includes a Tau-Bench case study showing competitive optimization signals compared to verifiable rewards and releases an open-source toolkit to accelerate evaluation of LLM applications.

Abstract

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Paper Structure

This paper contains 50 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: AutoMetrics takes you from expensive measures to interpretable automatic metrics. Here AutoMetrics generates useful metrics for evaluating LLM written product descriptions from user reviews from EvalGen 10.1145/3654777.3676450. Percentages indicate relative importance of each metric derived from regression coefficients.
  • Figure 2: AutoMetrics comprises four steps. (1) Generate: create task-specific candidate metrics (Single criteria, Rubric, Examples, MIPROv2). (2) Retrieve: from the generated candidates plus MetricBank, use ColBERT to prefilter to $k'$ metric cards and an LLM to select the final $k$. (3) Regress: fit a PLS model on the training set to weight and select metrics that predict human judgments. (4) Report: produce a writeup with weights and correlations and details to guide adoption.
  • Figure 3: Sensitivity/Stability of AutoMetrics for SimpEval, HelpSteer2, and CoGym. AutoMetrics are sensitive to negative perturbations and stable on neutral perturbations.
  • Figure 4: All correlations plotted for various training set sizes with "Generated Only" and "Full" Metric Banks. Individual trials are translucent while average performance at a scale is solid.
  • Figure 5: AutoMetrics produces three metrics for $\tau$-Bench. Regression coefficients in yellow.
  • ...and 3 more figures