Table of Contents
Fetching ...

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models

Suyoung Bae, YunSeok Choi, Jee-Hyong Lee

TL;DR

DeCAP addresses bias in zero-shot QA by introducing context-adaptive prompt generation that tailors debiasing actions to question context. It comprises Question Ambiguity Detection, which selects prefix instructions based on ambiguity, and Neutral Answer Guidance Generation, which retrieves neutral external demonstrations to guide unbiased judgments. Across BBQ and UNQOVER benchmarks and eight LLMs, DeCAP achieves state-of-the-art debiased QA performance, improving accuracy while reducing bias scores and mitigating context-dependent trade-offs. The method operates without retraining, leveraging retrieval-based neutral guidance and context-aware prompts to enhance fairness and accuracy in diverse QA settings.

Abstract

While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models

TL;DR

DeCAP addresses bias in zero-shot QA by introducing context-adaptive prompt generation that tailors debiasing actions to question context. It comprises Question Ambiguity Detection, which selects prefix instructions based on ambiguity, and Neutral Answer Guidance Generation, which retrieves neutral external demonstrations to guide unbiased judgments. Across BBQ and UNQOVER benchmarks and eight LLMs, DeCAP achieves state-of-the-art debiased QA performance, improving accuracy while reducing bias scores and mitigating context-dependent trade-offs. The method operates without retraining, leveraging retrieval-based neutral guidance and context-aware prompts to enhance fairness and accuracy in diverse QA settings.

Abstract

While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

Paper Structure

This paper contains 65 sections, 4 equations, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Overview of DeCAP: (A) Question Ambiguity Detection selects a Prefix instruction to provide clear instructions tailored to each question type. (B) Neutral Answer Guidance Generation generates a Neutral Answer Guidance to guide the LLM towards debiased answers by ensuring the LLM fairly considers the question. The sentences generated through processes (A) and (B) are added to each position on the context-adaptive prompt.
  • Figure 2: Effectiveness of reducing the performance gap between ambiguous and unambiguous questions: We compare the accuracy (left) and bias score (right) of Llama3 (8B) for each question type in the BBQ.
  • Figure 3: Debiasing performance across bias categories on various LLMs: Each cell displays the bias score of the corresponding model for the respective bias category. The left heatmap shows the bias scores calculated from the Base, while the right presents the results from DeCAP (ours). The $y$-axis represents the bias categories present in each dataset, and the $x$-axis represents the 8 LLMs.