Table of Contents
Fetching ...

Certifying Counterfactual Bias in LLMs

Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh

TL;DR

The paper addresses the lack of scalable, guaranteed evaluation of counterfactual bias in LLMs by introducing LLMCert-B, a black box certification framework that yields high confidence bounds on the probability of unbiased responses over distributions of counterfactual prompts. It formalizes bias as a relational property across demographic groups and employs probabilistic specifications defined via prompt distributions formed by prefixes, including random, jailbreak based, and embedding space perturbations. Certification relies on sampling from these distributions and using Clopper-Pearson confidence intervals to bound the target probability $\mathcal{C}(\Delta, \mathcal{D}, \mathcal{L})$, with $\mathcal{F}$ as a Bernoulli indicator of bias. Experiments on BOLD and Decoding Trust across multiple models demonstrate that simple prefix distributions can uncover previously unknown vulnerabilities, while random prefixes are less effective, highlighting the importance of adversarial prefix design and alignment quality over model size. The framework is open source and provides practical guidance for developers to assess and compare bias risk in LLM deployments.

Abstract

Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM's embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions.

Certifying Counterfactual Bias in LLMs

TL;DR

The paper addresses the lack of scalable, guaranteed evaluation of counterfactual bias in LLMs by introducing LLMCert-B, a black box certification framework that yields high confidence bounds on the probability of unbiased responses over distributions of counterfactual prompts. It formalizes bias as a relational property across demographic groups and employs probabilistic specifications defined via prompt distributions formed by prefixes, including random, jailbreak based, and embedding space perturbations. Certification relies on sampling from these distributions and using Clopper-Pearson confidence intervals to bound the target probability , with as a Bernoulli indicator of bias. Experiments on BOLD and Decoding Trust across multiple models demonstrate that simple prefix distributions can uncover previously unknown vulnerabilities, while random prefixes are less effective, highlighting the importance of adversarial prefix design and alignment quality over model size. The framework is open source and provides practical guidance for developers to assess and compare bias risk in LLM deployments.

Abstract

Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM's embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions.
Paper Structure (33 sections, 1 equation, 17 figures, 4 tables, 4 algorithms)

This paper contains 33 sections, 1 equation, 17 figures, 4 tables, 4 algorithms.

Figures (17)

  • Figure 1: (Overview of LLMCert-B): LLMCert-B is a quantitative certification framework to certify bias in target LLM's responses to a random set of prompts differing only by a sensitive attribute. In specific instantiations, LLMCert-B samples a (a) set of prefixes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. (c) The target LLM's responses are checked for bias by a bias detector, (d) whose results are fed into a certifier. (e) Certifier computes bounds using the Clopper-Pearson method clopper-pearson on probability of unbiased LLM responses for any set of prompts formed with a random prefix from prefix distribution.
  • Figure 2: Example from $\mathcal{Q}_{BOLD}$
  • Figure 3: Example from $\mathcal{Q}_{DT}$
  • Figure 4: Responses (and bounds) of Mistral, Llama-7B, Gemini, and GPT-4 when certifying with BOLD example 'Desktop automation refers to software programming' for mixture of jailbreaks.
  • Figure 5: Main Jailbreaks for different sensitive attributes
  • ...and 12 more figures

Theorems & Definitions (1)

  • Definition 1