What Makes an Evaluation Useful? Common Pitfalls and Best Practices
Gil Gekker, Meirav Segal, Dan Lahav, Omer Nevo
TL;DR
The paper tackles how to design useful safety evaluations for dangerous AI capabilities, focusing on decision-making contexts and risk framing. It presents a threat-model–driven framework that maps decision processes, threat models, risk scenarios, and derived capabilities to evaluation design, with a calculable threshold concept represented as capability thresholds $T$. Key contributions include outlining desirable properties of evaluations (realistic risk scenarios, training-set exclusion, explicit difficulty, high signal density, coherent scoring) and guidelines for building evaluation suites that balance coverage and difficulty. Although illustrated in cybersecurity contexts, the framework aims to be widely applicable and calls for empirical validation to establish practical community standards.
Abstract
Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.
