Table of Contents
Fetching ...

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

Liang Ding

Abstract

LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff's $α$=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

Abstract

LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff's =0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.
Paper Structure (71 sections, 2 theorems, 9 equations, 8 figures, 14 tables)

This paper contains 71 sections, 2 theorems, 9 equations, 8 figures, 14 tables.

Key Result

Proposition J.1

Under model equation eq:noise, let $\mu_j = \sum_k c_{k,j} s^*_{k,j}\,/\!\sum_k c_{k,j}$ be the confidence-weighted true score. The confidence-weighted estimator $\hat{\mu}_j = \sum_k \tfrac{c_{k,j}}{\sum_{k'}c_{k',j}} s_{k,j}$ is the Best Linear Unbiased Estimator (BLUE) for $\mu_j$. Moreover, with equality iff all $c_{k,j}$ are equal.

Figures (8)

  • Figure 1: Static evaluation vs. AdaRubric. Static LLM-as-Judge applies identical dimensions to all tasks, yielding weak human correlation ($r\approx 0.46$). AdaRubric synthesises task-specific rubrics from the task description, achieving $r\approx 0.77$. Pearson $r$ averaged over 300 held-out trajectory pairs per benchmark.
  • Figure 2: AdaRubric pipeline. Stage 1 synthesises a task-adaptive rubric. Stage 2 evaluates trajectories step-by-step with confidence weights. Stage 3 applies composable filters. The reward synthesis branch generates margin-gated DPO preference pairs.
  • Figure 3: Complete AdaRubric pipeline. All three stages are modular; any LLM can serve as $\mathcal{M}$.
  • Figure 4: Human correlation comparison.AdaRubric-DA achieves $r{=}0.79$ / $0.74$ on WebArena / ToolBench (highlighted row; large dot markers at bar tips). Dashed line = GPT-4 Direct baseline.
  • Figure 5: DPO training quality vs. number of pairs.AdaRubric-DA consistently outperforms all baselines across data regimes; diminishing returns appear beyond 6K pairs. Ada-WM uses weighted-mean aggregation; Random and Prometheus are baselines.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition 3.1: Task-Adaptive Rubric
  • Remark 1
  • Proposition J.1: BLUE of Confidence-Weighted Aggregation
  • proof
  • Proposition J.2: Masking-Prevention Separation
  • proof