Table of Contents
Fetching ...

FairCoder: Evaluating Social Bias of LLMs in Code Generation

Yongkang Du, Jen-tse Huang, Jieyu Zhao, Lu Lin

TL;DR

FairCoder introduces a dedicated benchmark to evaluate social bias in LLM-generated code across two real-world software engineering tasks: function implementation and unit-test generation. It defines three core fairness metrics and uses a bias-detection scorer plus a code-utility measure to assess outputs from 11 diverse LLMs. Experimental results show that all studied models exhibit social biases, with stronger biases in test-case generation and notable trade-offs between fairness and utility. The work highlights the need for broader attribute coverage and robust bias-mitigation techniques to promote fair and inclusive code-generation systems with real-world impact.

Abstract

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing studies typically identify bias by applying malicious prompts or reusing tasks and dataset originally designed for discriminative models. Given that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCoder, a novel benchmark for evaluating social bias in code generation. FairCoder explores the bias issue following the pipeline in software development, from function implementation to unit test, with diverse real-world scenarios. Additionally, three metrics are designed to assess fairness performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit social bias.

FairCoder: Evaluating Social Bias of LLMs in Code Generation

TL;DR

FairCoder introduces a dedicated benchmark to evaluate social bias in LLM-generated code across two real-world software engineering tasks: function implementation and unit-test generation. It defines three core fairness metrics and uses a bias-detection scorer plus a code-utility measure to assess outputs from 11 diverse LLMs. Experimental results show that all studied models exhibit social biases, with stronger biases in test-case generation and notable trade-offs between fairness and utility. The work highlights the need for broader attribute coverage and robust bias-mitigation techniques to promote fair and inclusive code-generation systems with real-world impact.

Abstract

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing studies typically identify bias by applying malicious prompts or reusing tasks and dataset originally designed for discriminative models. Given that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCoder, a novel benchmark for evaluating social bias in code generation. FairCoder explores the bias issue following the pipeline in software development, from function implementation to unit test, with diverse real-world scenarios. Additionally, three metrics are designed to assess fairness performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit social bias.
Paper Structure (37 sections, 3 equations, 10 figures, 18 tables, 2 algorithms)

This paper contains 37 sections, 3 equations, 10 figures, 18 tables, 2 algorithms.

Figures (10)

  • Figure 1: Demonstration of proposed benchmark. The x-axis represents the pipeline of our framework while y-axis represents the pipeline of software development. For function implementation and unit test, the input for LLMs consist with an unbiased function demo and a request which contain sensitive attributes. After generating the code, the metrics are calculated based on LLMs' output.
  • Figure 2: Code templates for function implementation.
  • Figure 3: Model preference on function implementation. The x-axis represents the attributes examined across the three scenarios, while the y-axis denotes the LLMs. The color of each dot indicates the group favored by the model, with larger dots signifying stronger preferences. A detailed version is provided in Figure \ref{['fig:result_function_detail']} in Appendix.
  • Figure 4: Refusal rate in function implementation and test case generation.
  • Figure 5: Job hiring.
  • ...and 5 more figures