Table of Contents
Fetching ...

Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting

Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki, Timothy Baldwin

TL;DR

This work investigates whether Chain-of-Thought prompting can mitigate gender bias in unscalable tasks by introducing the Multi-step Gender Bias Reasoning (MGBR) benchmark, where LLMs must label each word as feminine or masculine before counting. The study finds that CoT reduces observed bias across many models, especially larger ones, while simple Debiased Prompts are less effective. MGBR correlates with extrinsic bias benchmarks, shedding light on downstream bias rather than intrinsic biases, and larger models show stronger word-level alignment with human bias annotations. Limitations include language scope and binary gender focus, pointing to future work on non-binary genders and cross-lingual extensions.

Abstract

There exist both scalable tasks, like reading comprehension and fact-checking, where model performance improves with model size, and unscalable tasks, like arithmetic reasoning and symbolic reasoning, where model performance does not necessarily improve with model size. Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks. Unfortunately, despite their exceptional reasoning abilities, LLMs tend to internalize and reproduce discriminatory societal biases. Whether CoT can provide discriminatory or egalitarian rationalizations for the implicit information in unscalable tasks remains an open question. In this study, we examine the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks. For this purpose, we construct a benchmark for an unscalable task where the LLM is given a list of words comprising feminine, masculine, and gendered occupational words, and is required to count the number of feminine and masculine words. In our CoT prompts, we require the LLM to explicitly indicate whether each word in the word list is a feminine or masculine before making the final predictions. With counting and handling the meaning of words, this benchmark has characteristics of both arithmetic reasoning and symbolic reasoning. Experimental results in English show that without step-by-step prediction, most LLMs make socially biased predictions, despite the task being as simple as counting words. Interestingly, CoT prompting reduces this unconscious social bias in LLMs and encourages fair predictions.

Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting

TL;DR

This work investigates whether Chain-of-Thought prompting can mitigate gender bias in unscalable tasks by introducing the Multi-step Gender Bias Reasoning (MGBR) benchmark, where LLMs must label each word as feminine or masculine before counting. The study finds that CoT reduces observed bias across many models, especially larger ones, while simple Debiased Prompts are less effective. MGBR correlates with extrinsic bias benchmarks, shedding light on downstream bias rather than intrinsic biases, and larger models show stronger word-level alignment with human bias annotations. Limitations include language scope and binary gender focus, pointing to future work on non-binary genders and cross-lingual extensions.

Abstract

There exist both scalable tasks, like reading comprehension and fact-checking, where model performance improves with model size, and unscalable tasks, like arithmetic reasoning and symbolic reasoning, where model performance does not necessarily improve with model size. Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks. Unfortunately, despite their exceptional reasoning abilities, LLMs tend to internalize and reproduce discriminatory societal biases. Whether CoT can provide discriminatory or egalitarian rationalizations for the implicit information in unscalable tasks remains an open question. In this study, we examine the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks. For this purpose, we construct a benchmark for an unscalable task where the LLM is given a list of words comprising feminine, masculine, and gendered occupational words, and is required to count the number of feminine and masculine words. In our CoT prompts, we require the LLM to explicitly indicate whether each word in the word list is a feminine or masculine before making the final predictions. With counting and handling the meaning of words, this benchmark has characteristics of both arithmetic reasoning and symbolic reasoning. Experimental results in English show that without step-by-step prediction, most LLMs make socially biased predictions, despite the task being as simple as counting words. Interestingly, CoT prompting reduces this unconscious social bias in LLMs and encourages fair predictions.
Paper Structure (13 sections, 3 figures, 13 tables)

This paper contains 13 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: An example from the multi-step gender bias reasoning dataset.
  • Figure 2: The process of creating the MGBR benchmark.
  • Figure 3: Accuracy of the Few-shot, Few-shot+CoT, and Few-shot+Debiased for pro-stereotypical instances when using opt, llama2, and llama2-hf series LLMs, averaged over female and male instances.