Table of Contents
Fetching ...

How Natural Language Proficiency Shapes GenAI Code for Software Engineering Tasks

Ruksit Rojpaisarnkit, Youmei Fan, Kenichi Matsumoto, Raula Gaikovina Kula

TL;DR

Addresses whether English prompt proficiency, independent of prompting techniques, shapes code quality produced by FM-powered SE tools. The approach combines 164 HumanEval tasks, three LLMs (GPT-4o, Gemini 2.5 Pro, Claude Sonnet 4), and CEFR-based manipulations to quantify effects on code outputs, with descriptions tending to $B2$ baseline and code assessed via $pass@1$ on HumanEval and HumanEvalPlus. Higher-language prompts consistently improve code correctness across models, though effects on code proficiency are model-dependent. Findings establish natural language proficiency as a practical lever for reliability, informing prompt design, accessibility, and tool choice in software engineering workflows.

Abstract

With the widespread adoption of Foundation Model (FM)-powered tools in software engineering, the natural language prompt has become a critical interface between developers and Large Language Models (LLMs). While much research has focused on prompt structure, the natural language proficiency is an underexplored factor that can influence the quality of generated code. This paper investigates whether the English language proficiency itself independent of the prompting technique affects the proficiency and correctness of code generated by LLMs. Using the HumanEval dataset, we systematically varied the English proficiency of prompts from basic to advanced for 164 programming tasks and measured the resulting code proficiency and correctness. Our findings show that LLMs default to an intermediate (B2) natural language level. While the effect on the resulting code proficiency was model-dependent, we found that higher-proficiency prompts consistently yielded more correct code across all models. These results demonstrate that natural language proficiency is a key lever for controlling code generation, helping developers tailor AI output and improve the reliability of solutions.

How Natural Language Proficiency Shapes GenAI Code for Software Engineering Tasks

TL;DR

Addresses whether English prompt proficiency, independent of prompting techniques, shapes code quality produced by FM-powered SE tools. The approach combines 164 HumanEval tasks, three LLMs (GPT-4o, Gemini 2.5 Pro, Claude Sonnet 4), and CEFR-based manipulations to quantify effects on code outputs, with descriptions tending to baseline and code assessed via on HumanEval and HumanEvalPlus. Higher-language prompts consistently improve code correctness across models, though effects on code proficiency are model-dependent. Findings establish natural language proficiency as a practical lever for reliability, informing prompt design, accessibility, and tool choice in software engineering workflows.

Abstract

With the widespread adoption of Foundation Model (FM)-powered tools in software engineering, the natural language prompt has become a critical interface between developers and Large Language Models (LLMs). While much research has focused on prompt structure, the natural language proficiency is an underexplored factor that can influence the quality of generated code. This paper investigates whether the English language proficiency itself independent of the prompting technique affects the proficiency and correctness of code generated by LLMs. Using the HumanEval dataset, we systematically varied the English proficiency of prompts from basic to advanced for 164 programming tasks and measured the resulting code proficiency and correctness. Our findings show that LLMs default to an intermediate (B2) natural language level. While the effect on the resulting code proficiency was model-dependent, we found that higher-proficiency prompts consistently yielded more correct code across all models. These results demonstrate that natural language proficiency is a key lever for controlling code generation, helping developers tailor AI output and improve the reliability of solutions.

Paper Structure

This paper contains 8 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Evaluating Proficiency level in problem description and Proportion of code proficiency of generated solution when using LLM (Statistic result as shown in Table \ref{['tab:cefr-stat']} and Table \ref{['tab:code-stat']})
  • Figure :
  • Figure :
  • Figure :
  • Figure :