Table of Contents
Fetching ...

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Smriti Jha, Matteo Paltenghi, Chandra Maddila, Vijayaraghavan Murali, Shubham Ugare, Satish Chandra

Abstract

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench, a benchmark sourced from real developer-agent sessions. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates ranging from 53.2% to 72.2%. We demonstrate how these offline evaluation signals drive practical decisions around model selection and harness design, while noting that offline benchmarks provide directional signal that we complement with online A/B testing for production deployment decisions. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Abstract

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench, a benchmark sourced from real developer-agent sessions. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates ranging from 53.2% to 72.2%. We demonstrate how these offline evaluation signals drive practical decisions around model selection and harness design, while noting that offline benchmarks provide directional signal that we complement with online A/B testing for production deployment decisions. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

Paper Structure

This paper contains 39 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: ProdCodeBench verification pipeline. (Left) Starting from real developer-agent conversations, we apply prompt quality filters and test quality filters to obtain high-quality benchmark tasks with reliable fail-to-pass evaluation signals. (Right) Data funnel showing how the initial pool of conversations is progressively filtered to yield high-quality benchmark samples.
  • Figure 2: Tasks per category in ProdCodeBench, assigned by an LLM-powered task classifier based on prompt content.
  • Figure 3: Frontier model evaluations on ProdCodeBench. (Left) Model solve rates with 95% confidence intervals. (Right) Tool usage distribution across the same models demonstrating that Claude models have higher validation and testing rates compared to GPT Codex. All models are running on the same harness.
  • Figure 4: Harness evaluation comparison on ProdCodeBench. Agent-Basic + Opus 4.6 significantly outperforms Agent-Basic + GPT 5.1 Codex. Despite search tool limitations on Agent-Basic, Opus 4.6 is able to navigate the codebase better. GPT 5.1 Codex often exhausts maximum iterations at evaluation time on Agent-Basic due to prolonged search operations (see Table \ref{['tab:tool_usage']}). Agent-IDE outperforms Agent-Basic across Codex and Opus models. Agent-IDE has better code navigation which leads to fewer searches and better solve rates, see Table \ref{['tab:tool_usage']} for tool usage comparison.