Certifying Counterfactual Bias in LLMs
Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh
TL;DR
The paper addresses the lack of scalable, guaranteed evaluation of counterfactual bias in LLMs by introducing LLMCert-B, a black box certification framework that yields high confidence bounds on the probability of unbiased responses over distributions of counterfactual prompts. It formalizes bias as a relational property across demographic groups and employs probabilistic specifications defined via prompt distributions formed by prefixes, including random, jailbreak based, and embedding space perturbations. Certification relies on sampling from these distributions and using Clopper-Pearson confidence intervals to bound the target probability $\mathcal{C}(\Delta, \mathcal{D}, \mathcal{L})$, with $\mathcal{F}$ as a Bernoulli indicator of bias. Experiments on BOLD and Decoding Trust across multiple models demonstrate that simple prefix distributions can uncover previously unknown vulnerabilities, while random prefixes are less effective, highlighting the importance of adversarial prefix design and alignment quality over model size. The framework is open source and provides practical guidance for developers to assess and compare bias risk in LLM deployments.
Abstract
Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random token sequences, mixtures of manual jailbreaks, and perturbations of jailbreaks in LLM's embedding space. We generate non-trivial certificates for SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive prefix distributions.
