Table of Contents
Fetching ...

Bias Testing and Mitigation in LLM-based Code Generation

Dong Huang, Jie M. Zhang, Qingwen Bu, Xiaofei Xie, Junjie Chen, Heming Cui

TL;DR

This work introduces a dedicated bias testing framework for LLM-based code generation and defines Code Bias Score (CBS) along with two robustness variants (CBS_U@K, CBS_I@K) to quantify bias prevalence and consistency across runs. It applies the framework to five state-of-the-art LLMs across 334 bias-sensitive coding tasks (adult income, employability, health insurance), revealing pervasive biases especially for age, region, and gender. The methodology combines automated AST-based test-case generation with human review, and it evaluates five bias-mitigation prompts and a feedback-driven mitigation approach; findings show that prompt engineering alone yields limited bias reductions, while feedback from automated bias analysis dramatically lowers CBS (e.g., GPT-4 CBS from 59.88% to 4.79%). The results underscore the value of test-generation feedback in mitigating bias and highlight practical considerations such as test coverage, prompt design, and token usage for scalable bias detection in code-generation systems.

Abstract

As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models but are underexplored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive empirical study on the biases in code generated by five widely studied LLMs (i.e., PALM-2-CodeChat-bison, Claude-instant-1, GPT-3.5-turbo, GPT-4-turbo, and GPT-4). Our findings reveal that biases are prevalent. For example, 13.47% to 49.10% of the codes generated by these LLMs have biased behaviors towards gender. Moreover, we study five bias mitigation prompt strategies that are commonly used in current code generation scenarios, i.e., zero-shot, one-shot, few-shot, and two Chain-of-Thought (CoT) prompts, with and without provided feedback-driven refinement. Our evaluation results illustrate that using direct prompt engineering strategies has limited effectiveness in mitigating bias, but our test execution feedback can help to reduce the ratio of code biases to a large extent (e.g., from 59.88% to 4.79% for GPT-4).

Bias Testing and Mitigation in LLM-based Code Generation

TL;DR

This work introduces a dedicated bias testing framework for LLM-based code generation and defines Code Bias Score (CBS) along with two robustness variants (CBS_U@K, CBS_I@K) to quantify bias prevalence and consistency across runs. It applies the framework to five state-of-the-art LLMs across 334 bias-sensitive coding tasks (adult income, employability, health insurance), revealing pervasive biases especially for age, region, and gender. The methodology combines automated AST-based test-case generation with human review, and it evaluates five bias-mitigation prompts and a feedback-driven mitigation approach; findings show that prompt engineering alone yields limited bias reductions, while feedback from automated bias analysis dramatically lowers CBS (e.g., GPT-4 CBS from 59.88% to 4.79%). The results underscore the value of test-generation feedback in mitigating bias and highlight practical considerations such as test coverage, prompt design, and token usage for scalable bias detection in code-generation systems.

Abstract

As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models but are underexplored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive empirical study on the biases in code generated by five widely studied LLMs (i.e., PALM-2-CodeChat-bison, Claude-instant-1, GPT-3.5-turbo, GPT-4-turbo, and GPT-4). Our findings reveal that biases are prevalent. For example, 13.47% to 49.10% of the codes generated by these LLMs have biased behaviors towards gender. Moreover, we study five bias mitigation prompt strategies that are commonly used in current code generation scenarios, i.e., zero-shot, one-shot, few-shot, and two Chain-of-Thought (CoT) prompts, with and without provided feedback-driven refinement. Our evaluation results illustrate that using direct prompt engineering strategies has limited effectiveness in mitigating bias, but our test execution feedback can help to reduce the ratio of code biases to a large extent (e.g., from 59.88% to 4.79% for GPT-4).
Paper Structure (59 sections, 6 equations, 4 figures, 16 tables)

This paper contains 59 sections, 6 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Prompt examples used by previous method liu2023uncovering and us. Previous method liu2023uncovering directly utilizes uncompleted function definition with biased inputs, while we employ natural language prompts.
  • Figure 2: An illustration shows the manifestation of bias within LLMs that respond in natural language and within code generation models that respond to code function. We use code generation attributes in adult income to guide LLM to generate code (See Figure \ref{['fig:pipeline']}).
  • Figure 3: Our code bias evaluation pipeline.
  • Figure 4: Automated test case analysis feedback example for the generated code shown in \ref{['fig:pipeline']}.