Fairness of ChatGPT

Yunqi Li; Lanjing Zhang; Yongfeng Zhang

Fairness of ChatGPT

Yunqi Li, Lanjing Zhang, Yongfeng Zhang

TL;DR

This work tackles the problem of assessing fairness in ChatGPT within four high-stakes domains, using a comprehensive setup that includes four datasets (PISA, COMPAS, German Credit, Heart Disease), eight prompts (unbiased and biased), and both group and counterfactual individual fairness metrics. Disparities are quantified with $D_{SP}$, $D_{TPR}$, $D_{FPR}$, and change rates $CR_{Ovr}$ across settings, with comparisons to small baselines. Key findings show that ChatGPT often matches or exceeds small models on several domains and generally improves group fairness, but persistent fairness gaps persist and the results are highly sensitive to prompt design. The study provides a public fairness benchmark and practical guidance for mitigating bias in LLM deployment in real-world, high-stakes tasks.

Abstract

Understanding and addressing unfairness in LLMs are crucial for responsible AI deployment. However, there is a limited number of quantitative analyses and in-depth studies regarding fairness evaluations in LLMs, especially when applying LLMs to high-stakes fields. This work aims to fill this gap by providing a systematic evaluation of the effectiveness and fairness of LLMs using ChatGPT as a study case. We focus on assessing ChatGPT's performance in high-takes fields including education, criminology, finance and healthcare. To conduct a thorough evaluation, we consider both group fairness and individual fairness metrics. We also observe the disparities in ChatGPT's outputs under a set of biased or unbiased prompts. This work contributes to a deeper understanding of LLMs' fairness performance, facilitates bias mitigation and fosters the development of responsible AI systems.

Fairness of ChatGPT

TL;DR

, and change rates

across settings, with comparisons to small baselines. Key findings show that ChatGPT often matches or exceeds small models on several domains and generally improves group fairness, but persistent fairness gaps persist and the results are highly sensitive to prompt design. The study provides a public fairness benchmark and practical guidance for mitigating bias in LLM deployment in real-world, high-stakes tasks.

Abstract

Paper Structure (9 sections, 5 equations, 1 figure, 4 tables)

This paper contains 9 sections, 5 equations, 1 figure, 4 tables.

Introduction
Related Work
Experimental Settings
Datasets
Prompts
Models
Evaluation Metrics
Experimental Results
Conclusions and Future Work

Figures (1)

Figure 1: An example of Prompt 1 on COMPAS

Fairness of ChatGPT

TL;DR

Abstract

Fairness of ChatGPT

Authors

TL;DR

Abstract

Table of Contents

Figures (1)