Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer
TL;DR
The paper analyzes two proposed Monte Carlo-based estimators for AI-agent task-solving probabilities: the Milestone method, which factorizes $P(T^S)$ into a product of conditional milestone-solve probabilities, and the Expert Best-of-N method, which uses expert-provided completions to infer $P(T^S)$ via a bit-cost reweighting. Both approaches reduce variance relative to naive end-to-end sampling but introduce bias, with the Milestone method consistently underestimating true solve rates under realistic outcome-based grading, and the Expert Best-of-N method performing even more severely underestimation due to a flawed reweighting factor. Through formal connections to subset simulation and importance sampling, plus extensive experiments on GAIA-like tasks, the work shows that current estimators can misrepresent rare capabilities, highlighting the need to draw on the broader Monte Carlo literature for more accurate capability estimation. The authors propose exploring weighted-ensemble and other rare-event sampling techniques as promising directions to obtain more reliable estimates for hard, potentially safety-relevant tasks.
Abstract
To mitigate risks from AI systems, we need to assess their capabilities accurately. This is especially difficult in cases where capabilities are only rarely displayed. Phuong et al. propose two methods that aim to obtain better estimates of the probability of an AI agent successfully completing a given task. The milestone method decomposes tasks into subtasks, aiming to improve overall success rate estimation, while the expert best-of-N method leverages human guidance as a proxy for the model's independent performance. Our analysis of these methods as Monte Carlo estimators reveals that while both effectively reduce variance compared to naive Monte Carlo sampling, they also introduce bias. Experimental results demonstrate that the milestone method underestimates true solve rates for many real-world tasks due to its constraining assumptions. The expert best-of-N method exhibits even more severe underestimation across all tasks, attributed to an inherently flawed re-weighting factor. To enhance the accuracy of capability estimates of AI agents on difficult tasks, we suggest future work should leverage the rich literature on Monte Carlo Estimators.
