From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation

Weipeng Jiang; Xuanqi Gao; Juan Zhai; Shiqing Ma; Xiaoyu Zhang; Ziyan Lei; Chao Shen

From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation

Weipeng Jiang, Xuanqi Gao, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, Ziyan Lei, Chao Shen

TL;DR

This work investigates linguistic bias in LLM-driven code generation by comparing English and Chinese task descriptions using a unified evaluation framework. It introduces a bilingual dataset of 52 Python questions, automated correctness testing, and efficiency profiling via input-size dependent time measurements across ten LLMs, including GPT-3.5-Turbo and GPT-4. The study finds that about 12% of tasks differ in correctness and around 39% differ in efficiency between languages, with bias influenced by temperature and prompting. The authors provide a publicly available dataset and framework to facilitate ongoing research and advocate for mitigation strategies to ensure fair access to code-generation tools across languages.

Abstract

Large Language Models (LLMs) have demonstrated promising capabilities for code generation. While existing benchmarks evaluate the correctness and efficiency of LLM-generated code, the potential linguistic bias - where code quality varies based on the natural language used to describe programming tasks - remains underexplored. In this paper, we aim to investigate this linguistic bias through the lens of English and Chinese. To facilitate our investigation, we present a unified evaluation framework comprising a curated dataset of 52 Python programming questions with parallel bilingual task descriptions, automated correctness verification, and efficiency quantification tools based on runtime complexity estimation. Based on this framework, we conduct the first empirical study towards the linguistic bias in LLM-generated code on eight popular LCGMs, as well as GPT-3.5-Turbo and GPT-4. We observe that these LCGM-generated code show different correctness on an average of 12% bilingual programming tasks, where 39% also exhibits diverse efficiency. Our findings indicate that LLMs commonly exhibit linguistic bias for code generation.

From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation

TL;DR

Abstract

Paper Structure (21 sections, 10 equations, 8 figures, 3 tables)

This paper contains 21 sections, 10 equations, 8 figures, 3 tables.

Introduction
Related Work
Evaluation Design
Dataset Collection
Correctness Verification
Performance Estimation
Measurement Metrics
Target LLMs
Results and Analysis
RQ1: Study on Correctness.
RQ2: Study on Efficiency.
RQ3: The Impact of Prompting
Conclusions
Appendix
Example of Our Test Cases
...and 6 more sections

Figures (8)

Figure 1: Examples of linguistic bias in LLMs for code generation. The code generated by GPT-3.5 for the same programming task in English and Chinese exhibits differences in correctness and efficiency.
Figure 2: Venn Diagram of Efficiency Advantage.
Figure 3: The distribution of bilingual advantages of open-source LCGMs.
Figure 4: Different Prompting Methods.
Figure 5: Impact of Prompting on Bilingual Code Generation.
...and 3 more figures

From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation

TL;DR

Abstract

From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)