The AI Productivity Index (APEX)

Bertie Vidgen; Abby Fennelly; Evan Pinnix; Julien Benchek; Daniyal Khan; Zach Richards; Austin Bridges; Calix Huang; Kanishka Sahu; Abhishek Kottamasu; Bo Ma; Ben Hunsberger; Isaac Robinson; Akul Datta; Chirag Mahapatra; Dominic Barton; Cass R. Sunstein; Eric Topol; Brendan Foody; Osvald Nitski

The AI Productivity Index (APEX)

Bertie Vidgen, Abby Fennelly, Evan Pinnix, Julien Benchek, Daniyal Khan, Zach Richards, Austin Bridges, Calix Huang, Kanishka Sahu, Abhishek Kottamasu, Bo Ma, Ben Hunsberger, Isaac Robinson, Akul Datta, Chirag Mahapatra, Dominic Barton, Cass R. Sunstein, Eric Topol, Brendan Foody, Osvald Nitski

TL;DR

APEX-v1-extended extends a real-world AI productivity benchmark to four professional domains, expanding the heldout set to 400 cases and adding a 100-case devset to support open research. The evaluation combines rubric-based prompts, curated sources, and eight-run LM judging to produce robust, CI-backed scores across 10 frontier models, with GPT-5 (Thinking=High) leading at ~67%. Results reveal substantial gaps between frontier models and routine professional productivity, varying by task difficulty and domain, and show only modest consistency with the earlier devset. The work provides open data and tooling to accelerate research on AI-assisted professional work and highlights the need for further advances before frontier models reliably replace or augment human professionals.

Abstract

We present an extended version of the AI Productivity Index (APEX-v1-extended), a benchmark for assessing whether frontier models are capable of performing economically valuable tasks in four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD). This technical report details the extensions to APEX-v1, including an increase in the held-out evaluation set from n = 50 to n = 100 cases per job (n = 400 total) and updates to the grading methodology. We present a new leaderboard, where GPT5 (Thinking = High) remains the top performing model with a score of 67.0%. APEX-v1-extended shows that frontier models still have substantial limitations when performing typical professional tasks. To support further research, we are open sourcing n = 25 non-benchmark example cases per role (n = 100 total) along with our evaluation harness.

The AI Productivity Index (APEX)

TL;DR

Abstract

The AI Productivity Index (APEX)

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)