Table of Contents
Fetching ...

Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

Delip Rao, Chris Callison-Burch

TL;DR

Autorubric, an open-source Python framework that provides reliability metrics drawn from psychometrics alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking, is proposed.

Abstract

Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $κ$, weighted $κ$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.

Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

TL;DR

Autorubric, an open-source Python framework that provides reliability metrics drawn from psychometrics alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking, is proposed.

Abstract

Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's , weighted , correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
Paper Structure (34 sections, 3 equations, 4 figures, 5 tables)

This paper contains 34 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Autorubric evaluation pipeline. A RubricDataset packages a task prompt, a Rubric (weighted criteria), and DataItems (submissions to evaluate). The EvalRunner iterates over items, delegating each to Rubric.grade(), which dispatches to the CriterionGrader. The grader issues $N \times M$ independent LLM calls ($N$ judges, $M$ criteria) in parallel via asyncio.gather(), where each call evaluates one criterion under one judge---a single-judge configuration is treated as an ensemble of one, unifying the code path. Criteria may be binary (met/unmet) or multi-choice with option shuffling to mitigate position bias. Per-criterion votes are then combined by a configurable aggregation strategy (majority vote, weighted vote, mean, or mode) to produce an EnsembleEvaluationReport. Three cross-cutting concerns operate transparently: disk-based response caching avoids redundant LLM calls, per-provider rate limiting controls concurrency, and checkpoint files (manifest + JSONL) enable resumable evaluation runs.
  • Figure 2: Design space for rubric-based LLM evaluation. Each row represents one of the five dimensions examined in this section: evaluation mode (§\ref{['sec:eval-modes']}), rubric structure (§\ref{['sec:rubric-structure']}), judging strategy (§\ref{['sec:judging-strategies']}), calibration (§\ref{['sec:calibration']}), and reasoning (§\ref{['sec:reasoning-enhanced']}). Green pills indicate Autorubric defaults, blue pills indicate supported options, and gray dashed pills indicate paradigms covered in this paper but not yet implemented in the framework. Sub-groups within a dimension are separated by dashed lines; sub-notes indicate additional configuration (e.g., verdict balancing for few-shot calibration, budget levels for extended thinking).
  • Figure 3: An illustration of the length penalty function in Autorubric with varying exponents and penalty values to counteract verbosity bias. The penalty increases from zero once output tokens exceed the free budget, following a power curve up to the maximum cap (\ref{['eq:length-penalty']}).
  • Figure 4: Few-shot accuracy vs. cost on RiceChem dataset. Although it is not apparent in the plot, it must noted here that cost grows sub-linearly due to prompt caching.