FairCoder: Evaluating Social Bias of LLMs in Code Generation
Yongkang Du, Jen-tse Huang, Jieyu Zhao, Lu Lin
TL;DR
FairCoder introduces a dedicated benchmark to evaluate social bias in LLM-generated code across two real-world software engineering tasks: function implementation and unit-test generation. It defines three core fairness metrics and uses a bias-detection scorer plus a code-utility measure to assess outputs from 11 diverse LLMs. Experimental results show that all studied models exhibit social biases, with stronger biases in test-case generation and notable trade-offs between fairness and utility. The work highlights the need for broader attribute coverage and robust bias-mitigation techniques to promote fair and inclusive code-generation systems with real-world impact.
Abstract
Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing studies typically identify bias by applying malicious prompts or reusing tasks and dataset originally designed for discriminative models. Given that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCoder, a novel benchmark for evaluating social bias in code generation. FairCoder explores the bias issue following the pipeline in software development, from function implementation to unit test, with diverse real-world scenarios. Additionally, three metrics are designed to assess fairness performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit social bias.
