Uncovering Weaknesses in Neural Code Generation

Xiaoli Lian; Shuaisong Wang; Jieping Ma; Fang Liu; Xin Tan; Li Zhang; Lin Shi; Cuiyun Gao

Uncovering Weaknesses in Neural Code Generation

Xiaoli Lian, Shuaisong Wang, Jieping Ma, Fang Liu, Xin Tan, Li Zhang, Lin Shi, Cuiyun Gao

TL;DR

This paper addresses the lack of a comprehensive taxonomy of weaknesses in neural code generation by systematically evaluating five SOTA PLMs across three Python datasets using both match-based and execution-based metrics. Through thematic analysis of problematic outputs, it introduces a nine-type weakness taxonomy that covers benchmark design, prompt interpretation, and generated-code quality, revealing pervasive issues such as missing pivotal semantics and inaccurate API usage. The study also demonstrates how prompt-curation can measurably improve code quality, underscoring the importance of prompt design and targeted evaluation. By providing detailed distributions across models and benchmarks, the work offers practical guidance for researchers to prioritize weaknesses, refine prompts, and expand benchmarks to better assess real-world coding capabilities.

Abstract

Code generation, the task of producing source code from prompts, has seen significant advancements with the advent of pre-trained large language models (PLMs). Despite these achievements, there lacks a comprehensive taxonomy of weaknesses about the benchmark and the generated code, which risks the community's focus on known issues at the cost of under-explored areas. Our systematic study aims to fill this gap by evaluating five state-of-the-art PLMs: three larger models, CodeGen2.5 with 7 billion parameters, CodeGeeX2 with 6 billion parameters, GPT-4 Turbo, and two smaller ones, UnixCoder with 110 million parameters and CodeT5 base with 220 million parameters, across three popular datasets, CoNaLa, HumanEval Plus, and DS-1000. We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses. We dissected weakness distributions in both larger and smaller models, applying an extensive methodology that encompasses model-specific as well as collective analysis (union and intersection) across models. Our research uncovers three salient findings: 1. In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases, with even higher failure rates of 40% for smaller models; 2. Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks, and similarly high occurrences in HumanEval Plus (66.09%) and DS-1000 (80.51%); 3. All models struggle with proper API usage, a challenge amplified by vague or complex prompts. Our findings aim to steer researchers towards addressing specific weaknesses and challenges in code generation. Furthermore, our annotations can offer a targeted benchmark subset for detailed analysis.

Uncovering Weaknesses in Neural Code Generation

TL;DR

Abstract

Uncovering Weaknesses in Neural Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)