A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joelle Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor Horn; Karan Singhal

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Stephen R. Pfohl, Heather Cole-Lewis, Rory Sayres, Darlene Neal, Mercy Asiedu, Awa Dieng, Nenad Tomasev, Qazi Mamunur Rashid, Shekoofeh Azizi, Negar Rostamzadeh, Liam G. McCoy, Leo Anthony Celi, Yun Liu, Mike Schaekermann, Alanna Walton, Alicia Parrish, Chirag Nagpal, Preeti Singh, Akeiylah Dewitt, Philip Mansfield, Sushant Prakash, Katherine Heller, Alan Karthikesalingam, Christopher Semturs, Joelle Barral, Greg Corrado, Yossi Matias, Jamila Smith-Loud, Ivor Horn, Karan Singhal

TL;DR

This work presents resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions, and conducts a large-scale empirical case study with the Med-PaLM 2 LLM.

Abstract

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes, we hope that it can be leveraged and built upon towards a shared goal of LLMs that promote accessible and equitable healthcare.

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

TL;DR

Abstract

Paper Structure (44 sections, 12 figures, 19 tables)

This paper contains 44 sections, 12 figures, 19 tables.

Introduction
Results
Assessment Design
EquityMedQA
Empirical Study
Independent and Pairwise Analyses
Counterfactual Analyses
Consumer Study
Inter-rater Reliability
Application to Omiye et al.
Discussion
Limitations and Future Work
Methods
Assessment Design Methodology
Participatory approach with equity experts
...and 29 more sections

Figures (12)

Figure 1: Overview of our main contributions. We employ an iterative, participatory approach to design human assessment rubrics for surfacing health equity harms and biases; introduce EquityMedQA, a collection of seven newly released adversarial medical question-answering datasets enriched for equity-related content that substantially expands upon the volume and breadth of previously studied adversarial data for medical question answering; and perform a large scale empirical study of health equity-related biases in LLMs.
Figure 2: Results of independent evaluation of bias in Med-PaLM 2 answers. We report the rate at which raters reported minor or severe bias in Med-PaLM 2 answers for physician and health equity expert raters for each dataset and dimension of bias. The number of answers rated for each dataset are reported in \ref{['tab:evaluation-datasets-summary']} and the Methods section. Statistics for multiply-rated datasets (Mixed MMQA-OMAQ and Omiye et al.) are computed with pooling over replicates with the level of replication indicated in parentheses. Data are reported as proportions with 95% confidence intervals.
Figure 3: Results of pairwise evaluation of Med-PaLM 2 answers compared to Med-PaLM and physician answers. We report the rates at which raters reported a lesser degree of bias in Med-PaLM 2 answers versus comparator answers across datasets, rater types, and dimensions of bias. The number of answers rated for each dataset are reported in \ref{['tab:evaluation-datasets-summary']} and the Methods section. The comparator is Med-PaLM in all cases except for the case of physician-written answers to HealthSearchQA questions. Data are reported as proportions with 95% confidence intervals.
Figure 4: Results of counterfactual and independent evaluation on counterfactual datasets. In the top four rows, we report the rates at which raters reported bias in counterfactual pairs using the proposed counterfactual rubric as well as the rates at which they reported bias in one, one or more, or both of the answers using the independent evaluation rubric, for the CC-Manual (n=102 pairs, triple replication) and the CC-LLM datasets (n=200 pairs). For comparison, the bottom row reports independent evaluation results aggregated across all unpaired questions for the CC-Manual (n=42) and CC-LLM (n=100) datasets. For (A-B), data are reported as counts; for (C-D) data are reported as proportions with 95% confidence intervals.
Figure S1: Effect of aggregation method on the results of triple-rated independent evaluation of bias. We show rates at which raters reported answers as containing bias for the triple-rated Mixed MMQA-OMAQ dataset (n=240, triple-replication) across rater types, dimensions of bias, and methods of aggregation over raters. “Majority” and “Any" refer to rates at which at least two and one of the three raters reported bias, respectively. The “Pooled” rate treats all ratings as independent. Data are reported as proportions with 95% confidence intervals.
...and 7 more figures

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

TL;DR

Abstract

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)