Table of Contents
Fetching ...

Toward an Evaluation Science for Generative AI Systems

Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, William Isaac

TL;DR

The paper argues that static benchmarks and ad hoc audits are insufficient to assess real-world performance and safety of generative AI systems. It proposes an evaluation science for generative AI, drawing on safety engineering and measurement science, and identifies three guiding properties: real-world applicability, iterative refinement of metrics, and institutional support. It details how to operationalize these properties through real-world metrics, multi-level measurement, post-deployment monitoring, and shared evaluation infrastructure, including documentation standards. The work aims to guide scalable, transparent evaluation that can improve safety, accountability, and public trust as generative AI becomes widely deployed.

Abstract

There is an increasing imperative to anticipate and understand the performance and safety of generative AI systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems. While generative AI creates unique challenges for system safety engineering and measurement science, the field can draw valuable insights from the development of safety evaluation practices in other fields, including transportation, aerospace, and pharmaceutical engineering. In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established. Applying these insights, we outline a concrete path toward a more rigorous approach for evaluating generative AI systems.

Toward an Evaluation Science for Generative AI Systems

TL;DR

The paper argues that static benchmarks and ad hoc audits are insufficient to assess real-world performance and safety of generative AI systems. It proposes an evaluation science for generative AI, drawing on safety engineering and measurement science, and identifies three guiding properties: real-world applicability, iterative refinement of metrics, and institutional support. It details how to operationalize these properties through real-world metrics, multi-level measurement, post-deployment monitoring, and shared evaluation infrastructure, including documentation standards. The work aims to guide scalable, transparent evaluation that can improve safety, accountability, and public trust as generative AI becomes widely deployed.

Abstract

There is an increasing imperative to anticipate and understand the performance and safety of generative AI systems in real-world deployment contexts. However, the current evaluation ecosystem is insufficient: Commonly used static benchmarks face validity challenges, and ad hoc case-by-case audits rarely scale. In this piece, we advocate for maturing an evaluation science for generative AI systems. While generative AI creates unique challenges for system safety engineering and measurement science, the field can draw valuable insights from the development of safety evaluation practices in other fields, including transportation, aerospace, and pharmaceutical engineering. In particular, we present three key lessons: Evaluation metrics must be applicable to real-world performance, metrics must be iteratively refined, and evaluation institutions and norms must be established. Applying these insights, we outline a concrete path toward a more rigorous approach for evaluating generative AI systems.

Paper Structure

This paper contains 12 sections.