FairCoder: Evaluating Social Bias of LLMs in Code Generation

Yongkang Du; Jen-tse Huang; Jieyu Zhao; Lu Lin

FairCoder: Evaluating Social Bias of LLMs in Code Generation

Yongkang Du, Jen-tse Huang, Jieyu Zhao, Lu Lin

TL;DR

FairCoder introduces a dedicated benchmark to evaluate social bias in LLM-generated code across two real-world software engineering tasks: function implementation and unit-test generation. It defines three core fairness metrics and uses a bias-detection scorer plus a code-utility measure to assess outputs from 11 diverse LLMs. Experimental results show that all studied models exhibit social biases, with stronger biases in test-case generation and notable trade-offs between fairness and utility. The work highlights the need for broader attribute coverage and robust bias-mitigation techniques to promote fair and inclusive code-generation systems with real-world impact.

Abstract

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing studies typically identify bias by applying malicious prompts or reusing tasks and dataset originally designed for discriminative models. Given that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCoder, a novel benchmark for evaluating social bias in code generation. FairCoder explores the bias issue following the pipeline in software development, from function implementation to unit test, with diverse real-world scenarios. Additionally, three metrics are designed to assess fairness performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit social bias.

FairCoder: Evaluating Social Bias of LLMs in Code Generation

TL;DR

Abstract

Paper Structure (37 sections, 3 equations, 10 figures, 18 tables, 2 algorithms)

This paper contains 37 sections, 3 equations, 10 figures, 18 tables, 2 algorithms.

Introduction
Related Work
Methods
Function Implementation
Scenarios for Function Implementation
Test Case Generation
Topics for Test Case Generation
Metrics
Experiments
Experiment Setting
Overall Performance
Model Specific Observations
Preferred Groups in Different Topics
Potential Solution
Conclusion
...and 22 more sections

Figures (10)

Figure 1: Demonstration of proposed benchmark. The x-axis represents the pipeline of our framework while y-axis represents the pipeline of software development. For function implementation and unit test, the input for LLMs consist with an unbiased function demo and a request which contain sensitive attributes. After generating the code, the metrics are calculated based on LLMs' output.
Figure 2: Code templates for function implementation.
Figure 3: Model preference on function implementation. The x-axis represents the attributes examined across the three scenarios, while the y-axis denotes the LLMs. The color of each dot indicates the group favored by the model, with larger dots signifying stronger preferences. A detailed version is provided in Figure \ref{['fig:result_function_detail']} in Appendix.
Figure 4: Refusal rate in function implementation and test case generation.
Figure 5: Job hiring.
...and 5 more figures

FairCoder: Evaluating Social Bias of LLMs in Code Generation

TL;DR

Abstract

FairCoder: Evaluating Social Bias of LLMs in Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)