Social Bias Evaluation for Large Language Models Requires Prompt Variations
Rem Hida, Masahiro Kaneko, Naoaki Okazaki
TL;DR
This work demonstrates that social-bias evaluation for large language models is highly sensitive to how prompts are framed, including task instructions, few-shot examples, and debias-prompts. By systematically varying prompts across zero-shot, few-shot, and debias-prompt configurations on 12 models using the BBQ dataset, the authors show that model rankings and debiasing outcomes can fluctuate significantly with prompt format, revealing tradeoffs between task performance and bias. They also show that instance ambiguity contributes to sensitivity and that few-shot prompting can mitigate some but not all of these effects. The findings argue for adopting diverse prompt variations in bias evaluation to obtain robust, comparative assessments and to better understand the limits of debiasing methods in real-world use. This has practical impact for researchers and developers aiming to reliably audit and compare LLMs across tasks and domains.
Abstract
Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks as prompts to examine the degree of social biases for evaluation and mitigation. While LLMs' output highly depends on prompts, previous studies evaluating and mitigating bias have often relied on a limited variety of prompts. In this paper, we investigate the sensitivity of LLMs when changing prompt variations (task instruction and prompt, few-shot examples, debias-prompt) by analyzing task performance and social bias of LLMs. Our experimental results reveal that LLMs are highly sensitive to prompts to the extent that the ranking of LLMs fluctuates when comparing models for task performance and social bias. Additionally, we show that LLMs have tradeoffs between performance and social bias caused by the prompts. Less bias from prompt setting may result in reduced performance. Moreover, the ambiguity of instances is one of the reasons for this sensitivity to prompts in advanced LLMs, leading to various outputs. We recommend using diverse prompts, as in this study, to compare the effects of prompts on social bias in LLMs.
