A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

Glen Berman; Nitesh Goyal; Michael Madaio

A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

Glen Berman, Nitesh Goyal, Michael Madaio

TL;DR

RAI tools aim to shift AI development toward fairness, accountability, and transparency, but current evaluations emphasize usability over effectiveness. The study analyzes publicly available documentation from $37$ publications describing $27$ tools to identify pattern gaps. It argues that external validity and validity threats are under-addressed and draws lessons from education and medicine to propose an effectiveness evaluation framework. The authors outline design desiderata and field-level actions to enable more robust, multi-stakeholder evaluations of RAI tools, with potential policy and industry impact.

Abstract

Responsible design of AI systems is a shared goal across HCI and AI communities. Responsible AI (RAI) tools have been developed to support practitioners to identify, assess, and mitigate ethical issues during AI development. These tools take many forms (e.g., design playbooks, software toolkits, documentation protocols). However, research suggests that use of RAI tools is shaped by organizational contexts, raising questions about how effective such tools are in practice. To better understand how RAI tools are -- and might be -- evaluated, we conducted a qualitative analysis of 37 publications that discuss evaluations of RAI tools. We find that most evaluations focus on usability, while questions of tools' effectiveness in changing AI development are sidelined. While usability evaluations are an important approach to evaluate RAI tools, we draw on evaluation approaches from other fields to highlight developer- and community-level steps to support evaluations of RAI tools' effectiveness in shaping AI development practices and outcomes.

A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

TL;DR

publications describing

tools to identify pattern gaps. It argues that external validity and validity threats are under-addressed and draws lessons from education and medicine to propose an effectiveness evaluation framework. The authors outline design desiderata and field-level actions to enable more robust, multi-stakeholder evaluations of RAI tools, with potential policy and industry impact.

Abstract

Paper Structure (43 sections, 5 tables)

This paper contains 43 sections, 5 tables.

Introduction
Related work
Defining and developing RAI tools
Evaluating RAI tools
Tool evaluations in HCI
Evaluation goals and approaches outside of HCI
Evaluation validity
Methods
Developing the corpus of publications
Data analysis
Corpus description
Review of RAI tools in the corpus
Review of publications in the corpus
Positionality statement
RAI tool evaluation practices
...and 28 more sections

A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

TL;DR

Abstract

A Scoping Study of Evaluation Practices for Responsible AI Tools: Steps Towards Effectiveness Evaluations

Authors

TL;DR

Abstract

Table of Contents