A Rubric-Supervised Critic from Sparse Real-World Outcomes

Xingyao Wang; Valerie Chen; Heng Ji; Graham Neubig

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Xingyao Wang, Valerie Chen, Heng Ji, Graham Neubig

TL;DR

Critic Rubrics is introduced, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone that improves best-of-N reranking on SWE-bench and supports training-time data curation via critic-selected trajectories.

Abstract

Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

A Rubric-Supervised Critic from Sparse Real-World Outcomes

TL;DR

Abstract

Paper Structure (38 sections, 6 equations, 3 figures, 9 tables)

This paper contains 38 sections, 6 equations, 3 figures, 9 tables.

Introduction
Data: Modeling Interactions as Segments
Supervision in Verified-Reward Benchmarks
Representing Trajectories as Segments
Assigning Indirect Outcome Signals to Segments
Critic Rubrics
Rubric Design
Annotation Methodology
Rubric Effect Analysis
Critic Model Evaluation
Experimental Setup
Benchmark-Trained Critics Do Not Transfer to Real-world Data
Code Survival Provides More Fine-Grained Supervision
Effect on Down-stream Task Performance
Critics Enable Inference-Time Scaling
...and 23 more sections

Figures (3)

Figure 1: Overview of our method: Learning a deployable critic from production traces. We convert real-world human--agent interactions into segments (user request $\rightarrow$ agent actions $\rightarrow$ finish), annotate each segment with trace-observable Critic Rubrics (24 dense behavioral signals), and combine them with sparse production outcome proxies (e.g., PR merge / code survival) to train a semi-supervised, multi-task critic that predicts both rubric features and segment success. The resulting critic supports best-of-$K$ reranking, compute-efficient early stopping, and trajectory selection for training-time data curation.
Figure 2: From sparse outcomes to dense feedback in real-world usage. A pull request (PR) provides a coarse outcome signal (merged or not). Each PR contains commits, which we attribute to segments---self-contained units of agent work within multi-turn conversations (§\ref{['sec:segment']}). This hierarchy grounds supervision at multiple granularities: PR-merge labels apply to all segments linked to the PR, while code survival assigns fine-grained credit based on how much segment-authored code remains in the final diff. Critic Rubrics provide dense, outcome-agnostic supervision for every segment.
Figure 3: Rubric effects differ between benchmarks and real-world data. Each point shows the change in success probability when a rubric feature is present ($\Delta$), with 95% confidence intervals; red indicates FDR significance ($q<0.05$, where $q$ is the FDR-adjusted $p$-value). Bottom (benchmarks). SWE-bench and SWE-Gym show strong, consistent negative effects for core agent-behavior failures---especially incomplete_implementation, insufficient_testing, and insufficient_debugging---indicating these behaviors reliably predict unit-test failure. Top (real-world). In contrast, effect sizes are smaller and less consistently significant under PR-merge and code-survival proxies, reflecting noisier supervision and multi-turn credit assignment; user follow-up features (e.g., correction or reversion requests) exhibit stronger associations with code survival than with PR merge. Overall, benchmarks highlight stable failure modes, while real-world data exposes proxy-dependent and interaction-dependent effects.

A Rubric-Supervised Critic from Sparse Real-World Outcomes

TL;DR

Abstract

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Authors

TL;DR

Abstract

Table of Contents

Figures (3)