Table of Contents
Fetching ...

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Shachi H Kumar, Saurav Sahay, Sahisnu Mazumder, Eda Okur, Ramesh Manuvinakurike, Nicole Beckage, Hsuan Su, Hung-yi Lee, Lama Nachman

TL;DR

This work tackles the challenge of measuring gender bias in large language models by introducing an adversarial prompting framework that leverages an Attacker LLM, a diverse set of Target LLMs, and an Evaluator (LLM-as-a-Judge) to assess bias. It benchmarks multiple automatic metrics (Perspective API, sentiment, Regard, LlamaGuard2, OpenAI Compliance) against human evaluations, and demonstrates that the LLM-as-a-Judge metric best aligns with human judgments. The study uses counterfactual data augmentation and LoRA-finetuned attackers to generate gender-swapped prompts and evaluates across models of varying sizes, revealing that larger models tend to exhibit less bias. The findings support a path toward standardized bias evaluation and point to extensions beyond gender bias to other protected attributes, with implications for safer and fairer AI systems.

Abstract

Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

TL;DR

This work tackles the challenge of measuring gender bias in large language models by introducing an adversarial prompting framework that leverages an Attacker LLM, a diverse set of Target LLMs, and an Evaluator (LLM-as-a-Judge) to assess bias. It benchmarks multiple automatic metrics (Perspective API, sentiment, Regard, LlamaGuard2, OpenAI Compliance) against human evaluations, and demonstrates that the LLM-as-a-Judge metric best aligns with human judgments. The study uses counterfactual data augmentation and LoRA-finetuned attackers to generate gender-swapped prompts and evaluates across models of varying sizes, revealing that larger models tend to exhibit less bias. The findings support a path toward standardized bias evaluation and point to extensions beyond gender bias to other protected attributes, with implications for safer and fairer AI systems.

Abstract

Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.
Paper Structure (20 sections, 13 figures, 4 tables)

This paper contains 20 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 2: Bias Detection Workflow. The Attacker LLM synthesizes adversarial prompts for Target LLMs. Then, we apply a holistic evaluation of their responses to diagnose Target LLMs' biases. See Section \ref{['sec:method']} for details.
  • Figure 3: Human Evaluation - AMT Task 1 Description
  • Figure 4: Human Evaluation - AMT Task Description (2)
  • Figure 5: Human Evaluation - AMT Task Description (3)
  • Figure 6: Identity Attack Score Comparison
  • ...and 8 more figures