Table of Contents
Fetching ...

Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation

Fatima Kazi

TL;DR

The work addresses how Large Language Models inherit social biases from training data and proposes a bias-evaluation framework using StereoSet and CrowSPairs. It combines implicit and explicit bias prompting with data augmentation and cross-dataset testing across BERT, DistilBERT, GPT-3.5, and T5 to quantify and mitigate stereotypes. Key findings show that fine-tuning with augmented data can reduce implicit bias and improve cross-dataset performance, but gender bias remains challenging and benchmark limitations persist. The study introduces a Bias-Identification Framework to standardize bias assessment and highlights avenues for future research, including multimodal biases and more robust debiasing strategies.

Abstract

Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.

Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation

TL;DR

The work addresses how Large Language Models inherit social biases from training data and proposes a bias-evaluation framework using StereoSet and CrowSPairs. It combines implicit and explicit bias prompting with data augmentation and cross-dataset testing across BERT, DistilBERT, GPT-3.5, and T5 to quantify and mitigate stereotypes. Key findings show that fine-tuning with augmented data can reduce implicit bias and improve cross-dataset performance, but gender bias remains challenging and benchmark limitations persist. The study introduces a Bias-Identification Framework to standardize bias assessment and highlights avenues for future research, including multimodal biases and more robust debiasing strategies.

Abstract

Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.

Paper Structure

This paper contains 34 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 4.1: The framework outlines a process that begins with filtering the StereoSet and CrowS-Pairs datasets to create Multiple Choice Symbol Binding Questions (MCSBQ), which are then split into training and testing data. The training data undergoes augmentation using the T5 model, and both the original and augmented training data are used to fine-tune various models like DistilBERT, BERT, and GPT-3.5. These fine-tuned models are evaluated on the testing data and baseline models on the MCSBQ data. The results are analyzed using a three-pronged approach: quantitative (graphical representations), comparative (tabular formats), and qualitative (bags of words) techniques to uncover potential biases.
  • Figure 4.2: Distribution of prompts for each type of bias in StereoSet
  • Figure 4.3: Distribution of prompts for each type of bias in CrowSPairs
  • Figure 4.4: Distribution of targets associated with the religion stereotype in StereoSet
  • Figure 4.5: Distribution of targets associated with the gender stereotype in StereoSet