Improving AGI Evaluation: A Data Science Perspective
John Hawkins
TL;DR
AGI evaluation is challenging due to broad, ill-defined goals and a lack of perfect end-state metrics. The paper proposes a competence-centered evaluation framework inspired by data science, integrating agency-aware task design with out-of-time testing, group testing, and uncertainty quantification to ensure robustness and deployment relevance. It presents concrete protocols, including out-of-time benchmarking against tasks like the Bitcoin white paper and an admin-task simulator for uncertainty testing, alongside a taxonomy of agency levels to probe autonomous capability. This approach aims to curb benchmark gaming, improve generalization assessment, and accelerate trustworthy progress toward autonomous AGI in real-world settings.
Abstract
Evaluation of potential AGI systems and methods is difficult due to the breadth of the engineering goal. We have no methods for perfect evaluation of the end state, and instead measure performance on small tests designed to provide directional indication that we are approaching AGI. In this work we argue that AGI evaluation methods have been dominated by a design philosophy that uses our intuitions of what intelligence is to create synthetic tasks, that have performed poorly in the history of AI. Instead we argue for an alternative design philosophy focused on evaluating robust task execution that seeks to demonstrate AGI through competence. This perspective is developed from common practices in data science that are used to show that a system can be reliably deployed. We provide practical examples of what this would mean for AGI evaluation.
