Table of Contents
Fetching ...

Fairness of ChatGPT

Yunqi Li, Lanjing Zhang, Yongfeng Zhang

TL;DR

This work tackles the problem of assessing fairness in ChatGPT within four high-stakes domains, using a comprehensive setup that includes four datasets (PISA, COMPAS, German Credit, Heart Disease), eight prompts (unbiased and biased), and both group and counterfactual individual fairness metrics. Disparities are quantified with $D_{SP}$, $D_{TPR}$, $D_{FPR}$, and change rates $CR_{Ovr}$ across settings, with comparisons to small baselines. Key findings show that ChatGPT often matches or exceeds small models on several domains and generally improves group fairness, but persistent fairness gaps persist and the results are highly sensitive to prompt design. The study provides a public fairness benchmark and practical guidance for mitigating bias in LLM deployment in real-world, high-stakes tasks.

Abstract

Understanding and addressing unfairness in LLMs are crucial for responsible AI deployment. However, there is a limited number of quantitative analyses and in-depth studies regarding fairness evaluations in LLMs, especially when applying LLMs to high-stakes fields. This work aims to fill this gap by providing a systematic evaluation of the effectiveness and fairness of LLMs using ChatGPT as a study case. We focus on assessing ChatGPT's performance in high-takes fields including education, criminology, finance and healthcare. To conduct a thorough evaluation, we consider both group fairness and individual fairness metrics. We also observe the disparities in ChatGPT's outputs under a set of biased or unbiased prompts. This work contributes to a deeper understanding of LLMs' fairness performance, facilitates bias mitigation and fosters the development of responsible AI systems.

Fairness of ChatGPT

TL;DR

This work tackles the problem of assessing fairness in ChatGPT within four high-stakes domains, using a comprehensive setup that includes four datasets (PISA, COMPAS, German Credit, Heart Disease), eight prompts (unbiased and biased), and both group and counterfactual individual fairness metrics. Disparities are quantified with , , , and change rates across settings, with comparisons to small baselines. Key findings show that ChatGPT often matches or exceeds small models on several domains and generally improves group fairness, but persistent fairness gaps persist and the results are highly sensitive to prompt design. The study provides a public fairness benchmark and practical guidance for mitigating bias in LLM deployment in real-world, high-stakes tasks.

Abstract

Understanding and addressing unfairness in LLMs are crucial for responsible AI deployment. However, there is a limited number of quantitative analyses and in-depth studies regarding fairness evaluations in LLMs, especially when applying LLMs to high-stakes fields. This work aims to fill this gap by providing a systematic evaluation of the effectiveness and fairness of LLMs using ChatGPT as a study case. We focus on assessing ChatGPT's performance in high-takes fields including education, criminology, finance and healthcare. To conduct a thorough evaluation, we consider both group fairness and individual fairness metrics. We also observe the disparities in ChatGPT's outputs under a set of biased or unbiased prompts. This work contributes to a deeper understanding of LLMs' fairness performance, facilitates bias mitigation and fosters the development of responsible AI systems.
Paper Structure (9 sections, 5 equations, 1 figure, 4 tables)

This paper contains 9 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: An example of Prompt 1 on COMPAS