Table of Contents
Fetching ...

Forecasting Frontier Language Model Agent Capabilities

Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn

TL;DR

This paper addresses forecasting the frontier capabilities of autonomous language model agents, arguing that frontier performance—rather than average model scores—drives practical risk and capability deployment. It compares six forecasting methods, foregrounding a two-step approach that uses a linear Release Date → intermediate capability metric (PC-1 or Elo) followed by a sigmoidal mapping to benchmark scores, with Release Date → PC-1 → Benchmark performing best in backtests. The study validates this approach on OpenLLM frontier models and then applies it to three agent-centric benchmarks (SWE-Bench Verified, Cybench, RE-Bench) using a simple scaffold and two elicitation regimes (low vs high). Key findings show that non-specialized LM agents with low elicitation may reach 54% SWE-Bench success by early 2026, while best-known frontier scaffolds could reach 87%, though these forecasts exclude potential inference-time compute scaling and thus remain conservative. The work provides a practical framework for anticipating frontier capabilities and highlights the importance of benchmark choice, data signals, and elicitation in forecasting agentic risk and capability maturation.

Abstract

As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release Date$\to$Elo$\to$Benchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench (ML research engineering). Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate. Our approach does not account for recent advances in inference-compute scaling and might thus be too conservative.

Forecasting Frontier Language Model Agent Capabilities

TL;DR

This paper addresses forecasting the frontier capabilities of autonomous language model agents, arguing that frontier performance—rather than average model scores—drives practical risk and capability deployment. It compares six forecasting methods, foregrounding a two-step approach that uses a linear Release Date → intermediate capability metric (PC-1 or Elo) followed by a sigmoidal mapping to benchmark scores, with Release Date → PC-1 → Benchmark performing best in backtests. The study validates this approach on OpenLLM frontier models and then applies it to three agent-centric benchmarks (SWE-Bench Verified, Cybench, RE-Bench) using a simple scaffold and two elicitation regimes (low vs high). Key findings show that non-specialized LM agents with low elicitation may reach 54% SWE-Bench success by early 2026, while best-known frontier scaffolds could reach 87%, though these forecasts exclude potential inference-time compute scaling and thus remain conservative. The work provides a practical framework for anticipating frontier capabilities and highlights the importance of benchmark choice, data signals, and elicitation in forecasting agentic risk and capability maturation.

Abstract

As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release DateEloBenchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench (ML research engineering). Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate. Our approach does not account for recent advances in inference-compute scaling and might thus be too conservative.

Paper Structure

This paper contains 38 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Low-Elicitation and High-Elicitation forecasts for LM agent performance on SWE-Bench, Cybench, and RE-Bench. Elicitation level refers to performance improvements from optimizing agent scaffolds, tools, and prompts to achieve better results. Forecasts are generated by predicting Chatbot Arena Elo-scores from release date and then benchmark score from Elo. The low-elicitation (blue) forecasts serve as a conservative estimate, as the agent has not been optimized and does not leverage additional inference compute. The high-elicitation (orange) forecasts use the highest publicly reported performance scores. Because RE-Bench has no public high-elicitation data, it is excluded from these forecasts.
  • Figure 2: Six approaches for predicting frontier LM capabilities. Two direct methods (blue pathways) model benchmark performance as a sigmoid function of either release date or compute (log-FLOP). Four two-step methods (red and purple pathways) first use a linear function to predict intermediate capability metrics (PC-1 or Chatbot Arena Elo) from input variables, then map these metrics to benchmark scores using a sigmoid function.
  • Figure 3: Forecasting intermediate capability metrics from input variables for frontier models. We find that both PC-1 and Elo are surprisingly linear when predicted from FLOP and release date, with all combinations having a $R^2$$\ge$ 0.91.
  • Figure 4: Visualization of backtesting forecasts for MMLU-PRO using the full method. We split the data into 4 parts with an equal number of models. We then fit a full path on split 1 and test on split 2, fit on 1 & 2, and predict on 3, and so forth. Top: Comparing predicted to actual performance. Frontier models are marked with stars. Bottom: Average RMSE over frontier models. Bars are colored by the split they predict.
  • Figure 5: Predictions for a 0.9 success rate on SWE-Bench Verified and Cybench and a score of 1 on RE-Bench for low and high elicitation, respectively. We compute the distribution using bootstrapping with 10,000 samples. Note that the medians (50th percentile) of these histograms do not necessarily equal the forecasts made with all data points in Figure \ref{['fig:scaling-graph']}.
  • ...and 2 more figures