Table of Contents
Fetching ...

A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

Alexandra Chouldechova, Chad Atalla, Solon Barocas, A. Feder Cooper, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Nicholas Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Matthew Vogel, Hannah Washington, Hanna Wallach

TL;DR

The paper addresses the challenge of validly measuring GenAI systems' capabilities, risks, and impacts across heterogeneous evaluation practices. It extends Adcock & Collier's validity framework to GenAI by adding amounts, populations, and instances, with systematization, operationalization, and application guiding measurement across levels. The authors argue that this shared standard enables more reliable, comparable evaluations through explicit validity considerations and revision mechanisms, and illustrate the approach with examples like stereotyping in ChatSearch. Although the framework does not prescribe task formulation or interpretation, it aims to move GenAI evaluation toward a formal, science-like discipline by highlighting where validity concerns arise and how to address them.

Abstract

The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations.

A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

TL;DR

The paper addresses the challenge of validly measuring GenAI systems' capabilities, risks, and impacts across heterogeneous evaluation practices. It extends Adcock & Collier's validity framework to GenAI by adding amounts, populations, and instances, with systematization, operationalization, and application guiding measurement across levels. The authors argue that this shared standard enables more reliable, comparable evaluations through explicit validity considerations and revision mechanisms, and illustrate the approach with examples like stereotyping in ChatSearch. Although the framework does not prescribe task formulation or interpretation, it aims to move GenAI evaluation toward a formal, science-like discipline by highlighting where validity concerns arise and how to address them.

Abstract

The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations.

Paper Structure

This paper contains 4 sections, 2 figures.

Figures (2)

  • Figure 1: Our proposed framework for measurement tasks of the form: measure the [amount] of a [concept] in [instances] from a [population]. The figure shows how the four elements that make up such tasks---amounts, concepts, instances, and populations---are formalized through the sequential processes of systematization, operationalization and application. Elements in earlier levels (rows) can be revised and refined based on findings, including validity concerns, that arise in later levels.
  • Figure 2: A high-level example of using our framework in a hypothetical evaluation of a conversational search engine, ChatSearch. Each cell provides an overview of what a complete measurement procedure instantiated using the framework could look like. Note that a full instantiation would require providing considerable additional information. For example, a fully systematized complex concept or the full description of a complex sampling design might require several pages of exposition.