\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang; Yang Liu; Jiaqi Li; Jun Bai; Hao Chen; Kaiyuan Chen; Tiliang Duan; Jiayun Dong; Xiaobo Hu; Zixia Jia; Yang Liu; Tao Peng; Yixin Ren; Ran Tian; Zaiyuan Wang; Yanglihong Xiao; Gang Yao; Lingyue Yin; Ge Zhang; Chun Zhang; Jianpeng Jiao; Zilong Zheng; Yuan Gong

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, Yang Liu, Tao Peng, Yixin Ren, Ran Tian, Zaiyuan Wang, Yanglihong Xiao, Gang Yao, Lingyue Yin, Ge Zhang, Chun Zhang, Jianpeng Jiao, Zilong Zheng, Yuan Gong

TL;DR

The OneMillion-Bench is introduced, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios, to ensure meaningful differentiation across agents.

Abstract

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \$OneMillion-Bench \$OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

TL;DR

Abstract

OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.

Paper Structure (73 sections, 4 equations, 10 figures, 8 tables)

This paper contains 73 sections, 4 equations, 10 figures, 8 tables.

Introduction
How does $OneMillion-Bench Measure?
Economic Value
Expert Cost.
Wage Anchoring.
Expertise Measurement
Expert Score.
Pass Rate.
Score Aggregation Strategy.
Constructing the $OneMillion-Bench
Data Curation Pipeline
Stage 1: Task Creation.
Stage 2: Peer Review.
Stage 3: Resolution and Revision.
Data Overview
...and 58 more sections

Figures (10)

Figure 1: Leaderboard performance on $OneMillion-Bench.
Figure 2: Data curation pipeline of $OneMillion-Bench. The process involves domain experts designing specialized tasks with scoring rubrics, which are then peer-reviewed, validated against state-of-the-art agents to ensure discriminative power, and further refined through consensus.
Figure 3: $OneMillion-Bench consists of 5 macro domains, 37 sub-domains and 92 third-level categroies covering a wide variety of real applications and professional scenarios.
Figure 4: Sample data of different domains with varying scores of rubric weights and tags.
Figure 5: Comparison of web search scaffolds.
...and 5 more figures

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

TL;DR

Abstract

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)