Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms

Shalaleh Rismani; Renee Shelby; Leah Davis; Negar Rostamzadeh; AJung Moon

Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms

Shalaleh Rismani, Renee Shelby, Leah Davis, Negar Rostamzadeh, AJung Moon

TL;DR

This paper critiques the fragmented landscape of AI ethics measures by mapping nearly 800 measures to AI system components, attributes, hazards, and harms through a system-safety lens. Using a three-step methodology—scoping review, reflexive analysis, and dataset visualization—the authors reveal a strong bias toward four principles (fairness, transparency, privacy, trust) and model/output-level assessments, with limited cross-component evaluation and unclear ties to lived harms. They identify five harms categories (representational, allocative, quality of service, interpersonal, social system) and provide a public dataset plus an interactive visualization to support more holistic, time-aware, and context-sensitive evaluation practices. The findings carry governance, industry, and research implications, underscoring the need for more systematic, multi-component measurement frameworks that better anticipate and mitigate sociotechnical harms in AI deployments.

Abstract

Over the past decade, an ecosystem of measures has emerged to evaluate the social and ethical implications of AI systems, largely shaped by high-level ethics principles. These measures are developed and used in fragmented ways, without adequate attention to how they are situated in AI systems. In this paper, we examine how existing measures used in the computing literature map to AI system components, attributes, hazards, and harms. Our analysis draws on a scoping review resulting in nearly 800 measures corresponding to 11 AI ethics principles. We find that most measures focus on four principles - fairness, transparency, privacy, and trust - and primarily assess model or output system components. Few measures account for interactions across system elements, and only a narrow set of hazards is typically considered for each harm type. Many measures are disconnected from where harm is experienced and lack guidance for setting meaningful thresholds. These patterns reveal how current evaluation practices remain fragmented, measuring in pieces rather than capturing how harms emerge across systems. Framing measures with respect to system attributes, hazards, and harms can strengthen regulatory oversight, support actionable practices in industry, and ground future research in systems-level understanding.

Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms

TL;DR

Abstract

Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)