Table of Contents
Fetching ...

Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

Chahat Raj, Anjishnu Mukherjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

TL;DR

This work probes social biases in large language models through the lens of the Contact Hypothesis, introducing a principled prompting framework to simulate intergroup contact and measure bias across 13 dimensions in 5 societal scenarios. It pairs this probing with Social Contact Debiasing (SCD), an instruction-tuning approach that trains models on unbiased responses generated via contact-framed prompts, yielding substantial bias reductions (up to ~40% in 1 epoch for LLaMA 2). The study demonstrates that positive contact framing tends to reduce bias and that the debiasing generalizes across prompt scales, datasets, and even a downstream BBQ benchmark without harming fluency or relevance. The results highlight the potential of psychologically informed prompt engineering and fine-tuning to mitigate biases in LLMs, while acknowledging limitations related to principle interdependence, prompt scales, and language scope.

Abstract

Large Language Models (LLMs) perpetuate social biases, reflecting prejudices in their training data and reinforcing societal stereotypes and inequalities. Our work explores the potential of the Contact Hypothesis, a concept from social psychology for debiasing LLMs. We simulate various forms of social contact through LLM prompting to measure their influence on the model's biases, mirroring how intergroup interactions can reduce prejudices in social contexts. We create a dataset of 108,000 prompts following a principled approach replicating social contact to measure biases in three LLMs (LLaMA 2, Tulu, and NousHermes) across 13 social bias dimensions. We propose a unique debiasing technique, Social Contact Debiasing (SCD), that instruction-tunes these models with unbiased responses to prompts. Our research demonstrates that LLM responses exhibit social biases when subject to contact probing, but more importantly, these biases can be significantly reduced by up to 40% in 1 epoch of instruction tuning LLaMA 2 following our SCD strategy. Our code and data are available at https://github.com/chahatraj/breakingbias.

Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

TL;DR

This work probes social biases in large language models through the lens of the Contact Hypothesis, introducing a principled prompting framework to simulate intergroup contact and measure bias across 13 dimensions in 5 societal scenarios. It pairs this probing with Social Contact Debiasing (SCD), an instruction-tuning approach that trains models on unbiased responses generated via contact-framed prompts, yielding substantial bias reductions (up to ~40% in 1 epoch for LLaMA 2). The study demonstrates that positive contact framing tends to reduce bias and that the debiasing generalizes across prompt scales, datasets, and even a downstream BBQ benchmark without harming fluency or relevance. The results highlight the potential of psychologically informed prompt engineering and fine-tuning to mitigate biases in LLMs, while acknowledging limitations related to principle interdependence, prompt scales, and language scope.

Abstract

Large Language Models (LLMs) perpetuate social biases, reflecting prejudices in their training data and reinforcing societal stereotypes and inequalities. Our work explores the potential of the Contact Hypothesis, a concept from social psychology for debiasing LLMs. We simulate various forms of social contact through LLM prompting to measure their influence on the model's biases, mirroring how intergroup interactions can reduce prejudices in social contexts. We create a dataset of 108,000 prompts following a principled approach replicating social contact to measure biases in three LLMs (LLaMA 2, Tulu, and NousHermes) across 13 social bias dimensions. We propose a unique debiasing technique, Social Contact Debiasing (SCD), that instruction-tunes these models with unbiased responses to prompts. Our research demonstrates that LLM responses exhibit social biases when subject to contact probing, but more importantly, these biases can be significantly reduced by up to 40% in 1 epoch of instruction tuning LLaMA 2 following our SCD strategy. Our code and data are available at https://github.com/chahatraj/breakingbias.
Paper Structure (44 sections, 8 figures, 7 tables)

This paper contains 44 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We evaluate LLM responses to contact probing for social biases along several dimensions and verify if these responses align with the Contact Hypothesis.
  • Figure 2: An example of a certainty type prompt for positive contact with positive action in an education scenario which considers a particular descriptor ("deaf") from the Ability dimension to test whether contact hypothesis is followed for the key principle of equal group status.
  • Figure 3: Percentages of prompts to which LLaMA2-Chat(13B) generates a biased response across 13 dimensions of bias and 5 contact scenarios. Takeaway: Across scenarios, "Sports" shows the highest percentages of biased responses, particularly for the dimensions of "Religion", "Body type" and "Age". Across all scenarios, the dimension of "Political Ideologies" consistently shows a high percentage of biased responses.
  • Figure 4: Lighter shaded bars show the percentage of prompts that generate biased responses before instruction tuning, whereas darker shaded bars correspond to the same after instruction tuning. Darker bars begin at the top of the lighter bars (for example, for the first bar, the lighter bar is 41.64%, and the darker bar is only 2.4%). Takeaway: Instruction tuning on the prompt dataset reduces biases across all experimental settings.
  • Figure 5: Lighter shaded bars show the percentage of prompts that generate biased responses before instruction tuning, whereas darker shaded bars correspond to the same after instruction tuning. Darker bars begin at the top of the lighter bars (for example, for the last bar, the lighter bar is 46.73%, and the darker bar is 0.11%, which is negligible in the image). Takeaway: Instruction-tuning reduces biases to nearly zero (visualized by the absence of dark bars) across community and healthcare when tuned on education and workplace scenario prompts.
  • ...and 3 more figures