Table of Contents
Fetching ...

Social Bias Evaluation for Large Language Models Requires Prompt Variations

Rem Hida, Masahiro Kaneko, Naoaki Okazaki

TL;DR

This work demonstrates that social-bias evaluation for large language models is highly sensitive to how prompts are framed, including task instructions, few-shot examples, and debias-prompts. By systematically varying prompts across zero-shot, few-shot, and debias-prompt configurations on 12 models using the BBQ dataset, the authors show that model rankings and debiasing outcomes can fluctuate significantly with prompt format, revealing tradeoffs between task performance and bias. They also show that instance ambiguity contributes to sensitivity and that few-shot prompting can mitigate some but not all of these effects. The findings argue for adopting diverse prompt variations in bias evaluation to obtain robust, comparative assessments and to better understand the limits of debiasing methods in real-world use. This has practical impact for researchers and developers aiming to reliably audit and compare LLMs across tasks and domains.

Abstract

Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks as prompts to examine the degree of social biases for evaluation and mitigation. While LLMs' output highly depends on prompts, previous studies evaluating and mitigating bias have often relied on a limited variety of prompts. In this paper, we investigate the sensitivity of LLMs when changing prompt variations (task instruction and prompt, few-shot examples, debias-prompt) by analyzing task performance and social bias of LLMs. Our experimental results reveal that LLMs are highly sensitive to prompts to the extent that the ranking of LLMs fluctuates when comparing models for task performance and social bias. Additionally, we show that LLMs have tradeoffs between performance and social bias caused by the prompts. Less bias from prompt setting may result in reduced performance. Moreover, the ambiguity of instances is one of the reasons for this sensitivity to prompts in advanced LLMs, leading to various outputs. We recommend using diverse prompts, as in this study, to compare the effects of prompts on social bias in LLMs.

Social Bias Evaluation for Large Language Models Requires Prompt Variations

TL;DR

This work demonstrates that social-bias evaluation for large language models is highly sensitive to how prompts are framed, including task instructions, few-shot examples, and debias-prompts. By systematically varying prompts across zero-shot, few-shot, and debias-prompt configurations on 12 models using the BBQ dataset, the authors show that model rankings and debiasing outcomes can fluctuate significantly with prompt format, revealing tradeoffs between task performance and bias. They also show that instance ambiguity contributes to sensitivity and that few-shot prompting can mitigate some but not all of these effects. The findings argue for adopting diverse prompt variations in bias evaluation to obtain robust, comparative assessments and to better understand the limits of debiasing methods in real-world use. This has practical impact for researchers and developers aiming to reliably audit and compare LLMs across tasks and domains.

Abstract

Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks as prompts to examine the degree of social biases for evaluation and mitigation. While LLMs' output highly depends on prompts, previous studies evaluating and mitigating bias have often relied on a limited variety of prompts. In this paper, we investigate the sensitivity of LLMs when changing prompt variations (task instruction and prompt, few-shot examples, debias-prompt) by analyzing task performance and social bias of LLMs. Our experimental results reveal that LLMs are highly sensitive to prompts to the extent that the ranking of LLMs fluctuates when comparing models for task performance and social bias. Additionally, we show that LLMs have tradeoffs between performance and social bias caused by the prompts. Less bias from prompt setting may result in reduced performance. Moreover, the ambiguity of instances is one of the reasons for this sensitivity to prompts in advanced LLMs, leading to various outputs. We recommend using diverse prompts, as in this study, to compare the effects of prompts on social bias in LLMs.
Paper Structure (34 sections, 4 equations, 3 figures, 13 tables)

This paper contains 34 sections, 4 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Prompt Variations on Bias Evaluation: This example shows prompt variations on bias evaluation using downstream task (1) task instruction and prompts, (2) few-shot examples, and (3) debias-prompt. These variation factors can affect the scores. The instance was sampled from the BBQ dataset Parrish2022-gn.
  • Figure 2: Correlation between Metrics in Few-Shot Setting: $\text{Acc}_{\text{a}}$ and $\text{Acc}_{\text{d}}$ (left) have a negative correlation, which means a tradeoff on task performance exists between ambiguous and disambiguated contexts. $\text{Acc}_{\text{a}}$ and $\text{Diff-bias}_{\text{a}}$ (center left) have a little correlation. $\text{Acc}_{\text{d}}$ and $\text{Diff-bias}_{\text{d}}$(center right) have a positive correlation; however, it indicates a bad trend, meaning that bias increases as performance increases in a disambiguated context.
  • Figure 3: Sensitive Instance Number Histogram across Models: More instances are sensitive across more models, and its tendency is mitigated in the few-shot setting. Ambiguous context instances are more sensitive across models.