Table of Contents
Fetching ...

Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations

Sina Salimian, Gias Uddin, Sumon Biswas, Henry Leung

TL;DR

This work introduces metamorphic relations (MRs) as a principled framework for both detecting hidden biases in black-box LLMs and guiding targeted mitigation. Six MRs, spanning contextual and rephrasing transformations, generate semantically equivalent bias-inducing prompts to reveal consistency gaps in model outputs. The authors show that MR-based fine-tuning markedly improves bias resiliency across multiple models without harming performance on unbiased tasks, while few-shot learning yields uneven results. Overall, MR-driven testing and mitigation offer a practical, model-agnostic path to enhance fairness in conversational AI at scale.

Abstract

The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly linking the testing process to mitigation. Using six state-of-the-art LLMs - spanning open-source and proprietary models - and a representative subset of 385 questions from the 8,978-item BiasAsker benchmark covering seven protected groups, our MRs reveal up to 14% more hidden biases compared to existing tools. Moreover, fine-tuning with both original and MR-mutated samples significantly enhances bias resiliency, increasing safe response rates from 54.7% to over 88.9% across models. These results highlight metamorphic relations as a practical mechanism for improving fairness in conversational AI.

Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations

TL;DR

This work introduces metamorphic relations (MRs) as a principled framework for both detecting hidden biases in black-box LLMs and guiding targeted mitigation. Six MRs, spanning contextual and rephrasing transformations, generate semantically equivalent bias-inducing prompts to reveal consistency gaps in model outputs. The authors show that MR-based fine-tuning markedly improves bias resiliency across multiple models without harming performance on unbiased tasks, while few-shot learning yields uneven results. Overall, MR-driven testing and mitigation offer a practical, model-agnostic path to enhance fairness in conversational AI at scale.

Abstract

The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly linking the testing process to mitigation. Using six state-of-the-art LLMs - spanning open-source and proprietary models - and a representative subset of 385 questions from the 8,978-item BiasAsker benchmark covering seven protected groups, our MRs reveal up to 14% more hidden biases compared to existing tools. Moreover, fine-tuning with both original and MR-mutated samples significantly enhances bias resiliency, increasing safe response rates from 54.7% to over 88.9% across models. These results highlight metamorphic relations as a practical mechanism for improving fairness in conversational AI.

Paper Structure

This paper contains 33 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Test Case Generation Using Metamorphic Relations. We generate BiasTestSet and BiasTrainSet by applying Metamorphic Relations (MRs) to base questions, followed by semantic filtering. NormalInstrSet is blended with BiasTrainSet to fine-tune LLMs. Final responses are evaluated for bias.
  • Figure 2: Generating Base Questions from Bias Metadata (Protected Groups and Attributes)