Table of Contents
Fetching ...

Remote Labor Index: Measuring AI Automation of Remote Work

Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks

TL;DR

RLI addresses the problem of measuring AI automation in remote labor by introducing a large-scale, end-to-end benchmark built from real Upwork projects. It uses a manual evaluation pipeline to compare AI deliverables against human gold standards across 240 projects in 23 domains, with metrics such as automation rate, Elo, dollars earned, and autoflation. Results show current frontier AI agents perform near the floor, with an automation rate of $2.5\%$, but Elo-based comparisons reveal measurable progress across models. This work provides an empirical foundation for tracking AI-driven labor automation and informs policymakers, researchers, and stakeholders about the practical limits and potential trajectories of AI in remote work.

Abstract

AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

Remote Labor Index: Measuring AI Automation of Remote Work

TL;DR

RLI addresses the problem of measuring AI automation in remote labor by introducing a large-scale, end-to-end benchmark built from real Upwork projects. It uses a manual evaluation pipeline to compare AI deliverables against human gold standards across 240 projects in 23 domains, with metrics such as automation rate, Elo, dollars earned, and autoflation. Results show current frontier AI agents perform near the floor, with an automation rate of , but Elo-based comparisons reveal measurable progress across models. This work provides an empirical foundation for tracking AI-driven labor automation and informs policymakers, researchers, and stakeholders about the practical limits and potential trajectories of AI in remote work.

Abstract

AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

Paper Structure

This paper contains 75 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: The Remote Labor Index (RLI) represents a broad range of projects from across the remote labor economy, including game development, product design, architecture, and data analysis. All projects represent real work that was performed by human professionals.
  • Figure 2: All AI agents tested automate at most $2.5\%$ of tasks on RLI, showing that most economically valuable remote work currently remains far beyond their capabilities.
  • Figure 3: RLI captures a wide array of project types, spanning $23$ categories of work from the Upwork taxonomy. Here, we show the top seven categories.
  • Figure 4: RLI spans a broad range of difficulty, with project costs reaching over $\$10,\!000$ and completion times for human professionals reaching over $100$ hours. All project costs and completion times come directly from human professionals who completed the projects. In total, the projects in RLI represent over $6,\!000$ hours of real work valued at over $\$140,\!000$.
  • Figure 5: RLI projects were extensively filtered and cleaned to ensure quality. Projects were sourced primarily from the remote labor market and secondarily from deliverables representing uncommon and emerging types of remote work work. (For details, see Appendix \ref{['app:dataset_details']}.)
  • ...and 14 more figures