Table of Contents
Fetching ...

Measuring Agents in Production

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, Marquita Ellis

TL;DR

MAP presents the first large-scale empirical study of AI agents in production, combining a 306-response online survey with 20 in-depth case studies across 26 domains. The study reveals that production agents rely on simple, controllable methods with heavy human-in-the-loop evaluation, and that reliability is the dominant deployment hurdle despite broad real-world impact. It documents practical patterns—such as bounded autonomy, prompting-centric design, and selective post-training—that enable effective deployments, while highlighting gaps in benchmarks and observability. These insights bridge research and industry practice, informing both tool builders and practitioners about current constraints, patterns, and opportunities for expanding production-grade agent systems.

Abstract

AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.

Measuring Agents in Production

TL;DR

MAP presents the first large-scale empirical study of AI agents in production, combining a 306-response online survey with 20 in-depth case studies across 26 domains. The study reveals that production agents rely on simple, controllable methods with heavy human-in-the-loop evaluation, and that reliability is the dominant deployment hurdle despite broad real-world impact. It documents practical patterns—such as bounded autonomy, prompting-centric design, and selective post-training—that enable effective deployments, while highlighting gaps in benchmarks and observability. These insights bridge research and industry practice, informing both tool builders and practitioners about current constraints, patterns, and opportunities for expanding production-grade agent systems.

Abstract

AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.

Paper Structure

This paper contains 53 sections, 26 figures, 4 tables.

Figures (26)

  • Figure 1: Reasons practitioners build and deploy AI agents ($N{=}66$): increasing speed of task completion over the previous non-agentic (human or non-LLM automation) system (Increasing Productivity) is most often selected (73%), while improving operational stability (mitigating risk and accelerating failure recovery) is least often selected. Exact question and option descriptions are given in Appendix \ref{['app:questions_core']} (N7). $N{=}66$ reflects the filtered subset of survey participants working on deployed agents who answered this question. Error bars indicate 95th-percentile intervals estimated from 1,000 bootstrap samples with replacements. The question was multi-select so participants were asked to select all/any benefits that applied to them and thus the proportions do not sum to 1.
  • Figure 2: Application domains where practitioners build and deploy Agentic AI systems ($N=69$). Deployments span a broad range of industries, with the highest concentrations in finance, technology (including but not limited to software development), and corporate services. Additional lower-frequency domains ("other") are listed in Table \ref{['tab:single_occurrence_domains']}, totaling roughly 26 distinct domains. Note that this is a multi-class classification question where each system may be assigned to multiple domain categories.
  • Figure 3: The distribution of source institution maturities across in-depth interview-based case studies. The minority (5/20) are from seed-stage startups (validating product-market fit), early-stage startups (proving scalable business models), and growth-stage startups (rapidly expanding market share and operations). The majority (15/20) are from late-stage and mature institutions (with established market positions). The stages are approximated from limited public information e.g. size, sector, and annual recurring revenue.
  • Figure 4: Overview of survey respondent and system characteristics across all agents the survey data: (a) roles of survey participants by primary contribution area ($N{=}108$), (b) deployment stages of Agentic AI systems ($N{=}111$) that survey participants contributed to, and (c) reported number of end users ($N{=}35$) for the Agentic systems survey participants contributed to.
  • Figure 5: Overview of Agentic AI deployment characteristics in terms of primary end users and latency requirements: (a) Distribution of primary end users ($N{=}67$), where hatched bars (///) denote human end-users and solid bars denote non-human end-users; and (b) Reported tolerable end-to-end response latency for deployed systems ($N{=}53$). Over 92% of deployed agents primarily serve human users, and most systems tolerate response times on the order of minutes rather than requiring strict real-time responsiveness.
  • ...and 21 more figures