From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation
Weipeng Jiang, Xuanqi Gao, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, Ziyan Lei, Chao Shen
TL;DR
This work investigates linguistic bias in LLM-driven code generation by comparing English and Chinese task descriptions using a unified evaluation framework. It introduces a bilingual dataset of 52 Python questions, automated correctness testing, and efficiency profiling via input-size dependent time measurements across ten LLMs, including GPT-3.5-Turbo and GPT-4. The study finds that about 12% of tasks differ in correctness and around 39% differ in efficiency between languages, with bias influenced by temperature and prompting. The authors provide a publicly available dataset and framework to facilitate ongoing research and advocate for mitigation strategies to ensure fair access to code-generation tools across languages.
Abstract
Large Language Models (LLMs) have demonstrated promising capabilities for code generation. While existing benchmarks evaluate the correctness and efficiency of LLM-generated code, the potential linguistic bias - where code quality varies based on the natural language used to describe programming tasks - remains underexplored. In this paper, we aim to investigate this linguistic bias through the lens of English and Chinese. To facilitate our investigation, we present a unified evaluation framework comprising a curated dataset of 52 Python programming questions with parallel bilingual task descriptions, automated correctness verification, and efficiency quantification tools based on runtime complexity estimation. Based on this framework, we conduct the first empirical study towards the linguistic bias in LLM-generated code on eight popular LCGMs, as well as GPT-3.5-Turbo and GPT-4. We observe that these LCGM-generated code show different correctness on an average of 12% bilingual programming tasks, where 39% also exhibits diverse efficiency. Our findings indicate that LLMs commonly exhibit linguistic bias for code generation.
