Table of Contents
Fetching ...

Toward Systematic Counterfactual Fairness Evaluation of Large Language Models: The CAFFE Framework

Alessandra Parziale, Gianmario Voria, Valeria Pontillo, Gemma Catolino, Andrea De Lucia, Fabio Palomba

TL;DR

The paper tackles the challenge of systematically evaluating fairness in Large Language Models by introducing CAFFE, a structured, intent-aware framework for counterfactual fairness. It formalizes test cases using ISO/29119-aligned templates, automatically generates linguistically diverse counterfactual prompts from a stereotype knowledge base (CrowS-Pairs), and evaluates model responses with semantic similarity metrics. Empirical results across GPT-4o, LLaMA-2, and Mistral show CAFFE achieves broader bias coverage and up to a 60% improvement in fairness bug detection over metamorphic testing, with a robust methodology for metric selection and thresholding. A replication package supports reuse and reproducibility, and the work points to future extensions including domain-specific knowledge bases and multimodal fairness testing.

Abstract

Nowadays, Large Language Models (LLMs) are foundational components of modern software systems. As their influence grows, concerns about fairness have become increasingly pressing. Prior work has proposed metamorphic testing to detect fairness issues, applying input transformations to uncover inconsistencies in model behavior. This paper introduces an alternative perspective for testing counterfactual fairness in LLMs, proposing a structured and intent-aware framework coined CAFFE (Counterfactual Assessment Framework for Fairness Evaluation). Inspired by traditional non-functional testing, CAFFE (1) formalizes LLM-Fairness test cases through explicitly defined components, including prompt intent, conversational context, input variants, expected fairness thresholds, and test environment configuration, (2) assists testers by automatically generating targeted test data, and (3) evaluates model responses using semantic similarity metrics. Our experiments, conducted on three different architectural families of LLM, demonstrate that CAFFE achieves broader bias coverage and more reliable detection of unfair behavior than existing metamorphic approaches.

Toward Systematic Counterfactual Fairness Evaluation of Large Language Models: The CAFFE Framework

TL;DR

The paper tackles the challenge of systematically evaluating fairness in Large Language Models by introducing CAFFE, a structured, intent-aware framework for counterfactual fairness. It formalizes test cases using ISO/29119-aligned templates, automatically generates linguistically diverse counterfactual prompts from a stereotype knowledge base (CrowS-Pairs), and evaluates model responses with semantic similarity metrics. Empirical results across GPT-4o, LLaMA-2, and Mistral show CAFFE achieves broader bias coverage and up to a 60% improvement in fairness bug detection over metamorphic testing, with a robust methodology for metric selection and thresholding. A replication package supports reuse and reproducibility, and the work points to future extensions including domain-specific knowledge bases and multimodal fairness testing.

Abstract

Nowadays, Large Language Models (LLMs) are foundational components of modern software systems. As their influence grows, concerns about fairness have become increasingly pressing. Prior work has proposed metamorphic testing to detect fairness issues, applying input transformations to uncover inconsistencies in model behavior. This paper introduces an alternative perspective for testing counterfactual fairness in LLMs, proposing a structured and intent-aware framework coined CAFFE (Counterfactual Assessment Framework for Fairness Evaluation). Inspired by traditional non-functional testing, CAFFE (1) formalizes LLM-Fairness test cases through explicitly defined components, including prompt intent, conversational context, input variants, expected fairness thresholds, and test environment configuration, (2) assists testers by automatically generating targeted test data, and (3) evaluates model responses using semantic similarity metrics. Our experiments, conducted on three different architectural families of LLM, demonstrate that CAFFE achieves broader bias coverage and more reliable detection of unfair behavior than existing metamorphic approaches.

Paper Structure

This paper contains 18 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the CAFFE framework.
  • Figure 2: Explanatory example of test case construction and execution by CAFFE.
  • Figure 3: Number of prompts required to reach the entropy plateau for each bias category.
  • Figure 4: Comparison of ASR results for CAFFE and METAL.