Table of Contents
Fetching ...

BRIDGE: Predicting Human Task Completion Time From Model Performance

Fengyuan Liu, Jay Gala, Nilaksh, Dzmitry Bahdanau, Siva Reddy, Hugo Larochelle

TL;DR

BRIDGE presents a psychometric framework that aligns latent task difficulty inferred from model performance with human task completion time, enabling scalable human-centric evaluation without new human studies. By fitting a 2PL IRT model to binary model–task outcomes and anchoring the difficulty scale to METR’s human-time annotations, BRIDGE predicts human task durations for new benchmarks and forecasts frontier capabilities in human-interpretable units. The approach yields an approximately linear relationship between latent difficulty and the log of human time and reproduces METR-like exponential growth in solvable task horizons, with a 6-month doubling time for 50% success. This bridging of model-centric and human-centric metrics offers a scalable, interpretable means to track AI progress across diverse benchmarks and over time.

Abstract

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.

BRIDGE: Predicting Human Task Completion Time From Model Performance

TL;DR

BRIDGE presents a psychometric framework that aligns latent task difficulty inferred from model performance with human task completion time, enabling scalable human-centric evaluation without new human studies. By fitting a 2PL IRT model to binary model–task outcomes and anchoring the difficulty scale to METR’s human-time annotations, BRIDGE predicts human task durations for new benchmarks and forecasts frontier capabilities in human-interpretable units. The approach yields an approximately linear relationship between latent difficulty and the log of human time and reproduces METR-like exponential growth in solvable task horizons, with a 6-month doubling time for 50% success. This bridging of model-centric and human-centric metrics offers a scalable, interpretable means to track AI progress across diverse benchmarks and over time.

Abstract

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.
Paper Structure (31 sections, 3 equations, 12 figures)

This paper contains 31 sections, 3 equations, 12 figures.

Figures (12)

  • Figure 1: Overview of BRIDGE. Model responses across different benchmarks (clustered by colors) are used to fit a two-parameter logistic Item Response Theory (2PL IRT) model, estimating latent task difficulty and model capability. Calibrating latent difficulty against tasks with known human task completion times yields accurate predictions of human task duration for new benchmarks. We leverage this alignment to forecast frontier model capabilities in terms of human task length even in the absence of human task duration annotations.
  • Figure 2: Task length (human completion time) vs. latent task difficulty ($b$) estimated via 2PL IRT across METR task suites (SWAA, HCAST, RE-bench), based on \ref{['eq:humans']}. The log-linear fit ($R^2 = 0.81$) shows that each unit increase in $b$ corresponds to $\sim2.26\times$ longer human completion time. This calibration anchors the IRT latent difficulty scale to human-interpretable units to enable prediction of task duration directly from model performance.
  • Figure 3: Alignment between annotated human completion time buckets and estimated human completion times on SWE-bench Verified. We report per-bucket classification accuracy (Acc) and the number of tasks (n), as well as overall accuracy, weighted macro F1 score, and weighted kappa. We compare a logit success-rate heuristic, LLM-based time predictions (Gemini 3 Pro, GPT-5.2), and BRIDGE. BRIDGE achieves substantially better alignment with the annotated time buckets than both heuristic and LLM-based baselines.
  • Figure 4: Alignment between actual human completion time (first-solve time) and estimated completion times on Cybench. The logit success-rate baseline substantially underestimates task duration, while LLM-based estimates consistently overestimate it. In contrast, BRIDGE aligns closely with actual human times, with 92.3% of tasks falling within a $0.5\times \text{--} \; 2\times$ tolerance band.
  • Figure 5: Success probability versus estimated human task completion time for different models, smoothed with a window of 15 tasks. Solvable task lengths at the 50% success threshold are indicated across model release dates, with darker blue denoting more recent models. SOTA models achieve 50% success on tasks estimated to require $\sim$1.4–2.5 hours of human effort. Steeper curves reflect higher task discrimination parameters $a$. Non-smoothness arises from heterogeneity in task-level difficulty and discrimination $(a_i, b_i)$, highlighting the importance of task-level granularity. Shaded regions indicate $\pm1$ standard error for each latent task difficulty, averaged across each window, and transformed to human task completion time via \ref{['eq:humans']}.
  • ...and 7 more figures