Towards detecting unanticipated bias in Large Language Models

Anna Kruspe

Towards detecting unanticipated bias in Large Language Models

Anna Kruspe

TL;DR

This paper tackles unanticipated biases in LLMs and proposes a post-hoc, model-agnostic framework using Uncertainty Quantification and Explainable AI to surface subtle biases during inference. It reviews current bias research, defines key concepts, and outlines metrics and blind spots, then provides technical background on UQ and XAI. The authors propose practical UBD methods including TTDA, ensembles, verbal uncertainty, perturbation-based XAI, surrogate models, and prompting, along with evaluation strategies and mitigation ideas. A strong emphasis is placed on local explanations, visualization, and user feedback to empower practitioners to detect and mitigate biases, even when access to inner model details is limited.

Abstract

Over the last year, Large Language Models (LLMs) like ChatGPT have become widely available and have exhibited fairness issues similar to those in previous machine learning systems. Current research is primarily focused on analyzing and quantifying these biases in training data and their impact on the decisions of these models, alongside developing mitigation strategies. This research largely targets well-known biases related to gender, race, ethnicity, and language. However, it is clear that LLMs are also affected by other, less obvious implicit biases. The complex and often opaque nature of these models makes detecting such biases challenging, yet this is crucial due to their potential negative impact in various applications. In this paper, we explore new avenues for detecting these unanticipated biases in LLMs, focusing specifically on Uncertainty Quantification and Explainable AI methods. These approaches aim to assess the certainty of model decisions and to make the internal decision-making processes of LLMs more transparent, thereby identifying and understanding biases that are not immediately apparent. Through this research, we aim to contribute to the development of fairer and more transparent AI systems.

Towards detecting unanticipated bias in Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 3 figures)

This paper contains 29 sections, 3 figures.

Introduction
Current state of bias and fairness research
Bias sources and definitions
Metrics and evaluation
Bias mitigation
Blind spots
Technical background
Uncertainty Quantification (UQ)
Single Deterministic Networks
Bayesian methods
Ensemble methods
Test-time Data Augmentation
Calibration
Explainable AI (XAI)
Fine-tuning paradigm: Local explanations
...and 14 more sections

Figures (3)

Figure 1: ChatGPT reply for the prompt "Write a story about two friends, one tall and one short, and their careers" and visualizations. The example illustrates a bias that is not often considered, and would not be apparent when prompting directly for stereotypical characteristics of people of different heights. It is easy to see how this implicit bias in GPT-4 could lead to unfair decisions, e.g. in recruiting applications.
Figure 2: A mockup of a potential uncertainty result: Offering multiple response alternatives, in this case for the task of translating into a language with gendered inflections. [Icons: flaticon.com]
Figure 3: A mockup of a potential explainability result: Demonstrating influence factors in the input. Medical example from nguyen. [Icons: flaticon.com]

Towards detecting unanticipated bias in Large Language Models

TL;DR

Abstract

Towards detecting unanticipated bias in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)