Table of Contents
Fetching ...

What Makes an Evaluation Useful? Common Pitfalls and Best Practices

Gil Gekker, Meirav Segal, Dan Lahav, Omer Nevo

TL;DR

The paper tackles how to design useful safety evaluations for dangerous AI capabilities, focusing on decision-making contexts and risk framing. It presents a threat-model–driven framework that maps decision processes, threat models, risk scenarios, and derived capabilities to evaluation design, with a calculable threshold concept represented as capability thresholds $T$. Key contributions include outlining desirable properties of evaluations (realistic risk scenarios, training-set exclusion, explicit difficulty, high signal density, coherent scoring) and guidelines for building evaluation suites that balance coverage and difficulty. Although illustrated in cybersecurity contexts, the framework aims to be widely applicable and calls for empirical validation to establish practical community standards.

Abstract

Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.

What Makes an Evaluation Useful? Common Pitfalls and Best Practices

TL;DR

The paper tackles how to design useful safety evaluations for dangerous AI capabilities, focusing on decision-making contexts and risk framing. It presents a threat-model–driven framework that maps decision processes, threat models, risk scenarios, and derived capabilities to evaluation design, with a calculable threshold concept represented as capability thresholds . Key contributions include outlining desirable properties of evaluations (realistic risk scenarios, training-set exclusion, explicit difficulty, high signal density, coherent scoring) and guidelines for building evaluation suites that balance coverage and difficulty. Although illustrated in cybersecurity contexts, the framework aims to be widely applicable and calls for empirical validation to establish practical community standards.

Abstract

Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a "good evaluation" has yet to be agreed upon. In this practitioners' perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.

Paper Structure

This paper contains 37 sections, 2 figures.

Figures (2)

  • Figure 1: Suggested pre-design steps for safety evaluation development.
  • Figure 2: Visualization of the desired properties for an evaluation suite. The critical capability we aim to evaluate is represented by the circle. Each gray rectangle represents an evaluation, where the size of the rectangle indicates its difficulty level (larger means more difficult). A useful evaluation suite is composed of a set of evaluations which cover the space of the critical capability. A useful suite often contains evaluations of different difficulties, and includes some coverage overlap between evaluations. The scoring method used to aggregate individual evaluation scores is selected to best serve the decision making process.