Table of Contents
Fetching ...

Towards Effective Discrimination Testing for Generative AI

Thomas P. Zollo, Nikita Rajaneesh, Richard Zemel, Talia B. Gillis, Emily Black

TL;DR

GenAI fairness research currently lacks the specificity and context-sensitivity required by anti-discrimination regulation, risking deployment of systems that appear fair but cause discriminatory outcomes in practice. The paper synthesizes legal and technical perspectives and presents four case studies showing misalignment between common GenAI fairness tests and regulatory objectives, including downstream harms, red-teaming variability, complex interaction modes, and user-driven modifications. It argues for context-specific, robust testing frameworks that mirror deployment conditions and consider downstream allocation effects, multi-turn interactions, and parameter changes. The work provides practical mitigation directions—such as domain-tailored evaluation suites, multi-method red-teaming, and parameter-risk monitoring—to improve the reliability of fairness assessments in real-world GenAI deployments and to better inform policy. Overall, it highlights the need for technically grounded, regulatable testing protocols that reduce discrimination hazards in GenAI systems while supporting accountability and liability considerations.

Abstract

Generative AI (GenAI) models present new challenges in regulating against discriminatory behavior. In this paper, we argue that GenAI fairness research still has not met these challenges; instead, a significant gap remains between existing bias assessment methods and regulatory goals. This leads to ineffective regulation that can allow deployment of reportedly fair, yet actually discriminatory, GenAI systems. Towards remedying this problem, we connect the legal and technical literature around GenAI bias evaluation and identify areas of misalignment. Through four case studies, we demonstrate how this misalignment between fairness testing techniques and regulatory goals can result in discriminatory outcomes in real-world deployments, especially in adaptive or complex environments. We offer practical recommendations for improving discrimination testing to better align with regulatory goals and enhance the reliability of fairness assessments in future deployments.

Towards Effective Discrimination Testing for Generative AI

TL;DR

GenAI fairness research currently lacks the specificity and context-sensitivity required by anti-discrimination regulation, risking deployment of systems that appear fair but cause discriminatory outcomes in practice. The paper synthesizes legal and technical perspectives and presents four case studies showing misalignment between common GenAI fairness tests and regulatory objectives, including downstream harms, red-teaming variability, complex interaction modes, and user-driven modifications. It argues for context-specific, robust testing frameworks that mirror deployment conditions and consider downstream allocation effects, multi-turn interactions, and parameter changes. The work provides practical mitigation directions—such as domain-tailored evaluation suites, multi-method red-teaming, and parameter-risk monitoring—to improve the reliability of fairness assessments in real-world GenAI deployments and to better inform policy. Overall, it highlights the need for technically grounded, regulatable testing protocols that reduce discrimination hazards in GenAI systems while supporting accountability and liability considerations.

Abstract

Generative AI (GenAI) models present new challenges in regulating against discriminatory behavior. In this paper, we argue that GenAI fairness research still has not met these challenges; instead, a significant gap remains between existing bias assessment methods and regulatory goals. This leads to ineffective regulation that can allow deployment of reportedly fair, yet actually discriminatory, GenAI systems. Towards remedying this problem, we connect the legal and technical literature around GenAI bias evaluation and identify areas of misalignment. Through four case studies, we demonstrate how this misalignment between fairness testing techniques and regulatory goals can result in discriminatory outcomes in real-world deployments, especially in adaptive or complex environments. We offer practical recommendations for improving discrimination testing to better align with regulatory goals and enhance the reliability of fairness assessments in future deployments.
Paper Structure (37 sections, 8 figures, 9 tables)

This paper contains 37 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The output of classification models can often be directly mapped onto allocative decisions, and thus traditional discrimination law can be applied directly. GenAI models bring unique challenges to applying both existing and emerging regulation. Most notably: 1) outputs are difficult to evaluate, and do not clearly map onto decisions; 2) complex interaction modes, such as multi-turn dialogue, cannot be easily recreated in test settings; 3) testing procedures (e.g., a particular red teaming approach) are sensitive to small changes in conditions and give highly variable results; 4) users may modify models after deployment, for example by changing sampling parameters.
  • Figure 2: Left: Summary quality is scored using ROUGE, and compared across models and racial groups. Llama-2-7B produces the highest average score, and all models offer similar performance across groups--suggesting Llama-2-7B may be chosen to deploy. Right: Though all resumes are the same, simulated outcomes produce different selection rates across groups. Llama-2-7B produces a $\sim$5% maximum gap across racial groups, while for Gemma-2 the difference is less than 2%.
  • Figure 3: Plotting the differences between alternative fairness metrics across groups against selection disparities. More discriminatory models (Llama-2 and Qwen) based on selection rate perform poorly according to these metrics; the less discriminatory models (Mistral and Gemma-2) perform relatively well. Such a holistic evaulation may have identified Gemma-2 as a less discriminatory alternative for deployment than Llama-2.
  • Figure 4: Red teaming results for bias against women, where higher scores indicate more toxic output. For each column, green is the most fair and red is the least fair. Variation across rows shows how the perceived fairness of candidate models is determined by a red team's testing decisions. If Mistral-7B is chosen as RedLM, the least fair model (Llama3-8B) may seem to be most fair.
  • Figure 5: Models undergo red teaming in the single- and multi-turn settings, with data from different domains and attacks from different LLMs. Gemma-2-9B (green) seems less discriminatory in the single-turn setting, but in fact exhibits worse behavior than Gemma-2-2B (red) in the context of a conversation.
  • ...and 3 more figures