Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings
Zhao Liu, Tian Xie, Xueru Zhang
TL;DR
This work introduces Open-BBQ, an open-ended bias benchmark that extends BBQ with fill-in-the-blank and short-answer prompts to better reflect real-world LLM usage. It establishes an automated evaluation framework using GPT-4o with zero-shot, few-shot, and chain-of-thought prompting to convert free-text responses into comparable bias labels, and it analyzes bias across ambiguous and disambiguated questions for GPT-3.5 and GPT-4o. To address the observed over-correction issues of self-debiasing, the authors propose Composite Prompting, a two-stage, in-context learning approach that first recognizes task ambiguity and then selectively applies debiasing only to ambiguous questions, preserving accuracy on disambiguated cases. Empirical results show that Composite Prompting reduces bias close to zero while maintaining high accuracy across most demographic dimensions and question types, highlighting a practical, scalable path toward fairer open-ended LLM interactions. Overall, Open-BBQ provides a robust, scalable framework for evaluating and mitigating social bias in realistic, open-ended LLM settings with broad implications for deployment in safety- and fairness-critical applications.
Abstract
Current social bias benchmarks for Large Language Models (LLMs) primarily rely on predefined question formats like multiple-choice, limiting their ability to reflect the complexity and open-ended nature of real-world interactions. To close this gap, we extend an existing dataset BBQ (Parrish et al., 2022) to Open-BBQ, a comprehensive framework to evaluate the social bias of LLMs in open-ended settings by incorporating two additional question categories: fill-in-the-blank and short-answer. Since our new Open-BBQ dataset contains a lot of open-ended responses like sentences and paragraphs, we developed an evaluation process to detect biases from open-ended content by labeling sentences and paragraphs. In addition to this, we also found that existing debiasing methods, such as self-debiasing (Gallegos et al., 2024), have over-correction issues, which make the original correct answers incorrect. In order to solve this issue, we propose Composite Prompting, an In-context Learning (ICL) method combining structured examples with explicit chain-of-thought reasoning to form a unified instruction template for LLMs to explicitly identify content that needs debiasing. Experimental results show that the proposed method significantly reduces the bias for both GPT-3.5 and GPT-4o while maintaining high accuracy.
