Table of Contents
Fetching ...

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, Dawn Song

TL;DR

<3-5 sentence high-level summary> The paper defines code hallucination as plausible-looking code that fails to execute correctly or meet task requirements, and argues for execution-based verification to assess code generation. It introduces CodeHalu, a dynamic detection algorithm, and CodeHaluEval, a benchmark to quantify code hallucinations across 17 mainstream LLMs. The approach yields a four-type taxonomy (Mapping, Naming, Resource, Logic) with an 8-subcategory granularity, and demonstrates significant model-to-model variation, especially in logical hallucinations, with overall hallucination rates in the 20–60% range. The work provides a structured framework, benchmark, and actionable guidance for improving code generation reliability through execution-driven verification and targeted data/architecture improvements.

Abstract

Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

TL;DR

<3-5 sentence high-level summary> The paper defines code hallucination as plausible-looking code that fails to execute correctly or meet task requirements, and argues for execution-based verification to assess code generation. It introduces CodeHalu, a dynamic detection algorithm, and CodeHaluEval, a benchmark to quantify code hallucinations across 17 mainstream LLMs. The approach yields a four-type taxonomy (Mapping, Naming, Resource, Logic) with an 8-subcategory granularity, and demonstrates significant model-to-model variation, especially in logical hallucinations, with overall hallucination rates in the 20–60% range. The work provides a structured framework, benchmark, and actionable guidance for improving code generation reliability through execution-driven verification and targeted data/architecture improvements.

Abstract

Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.
Paper Structure (13 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The definition and classification of code hallucinations, including 4 main categories and 8 subcategories.
  • Figure 2: Examples that differentiate between code errors and code hallucinations.
  • Figure 3: The diagram illustrates the intersection of various hallucinations in Gemma-7B during the CodeHaluEval. The bar chart at the top shows the frequency of each intersection, while the bar chart on the left indicates the frequency of each type of hallucination. The connecting lines represent the co-occurrence patterns between different hallucinations.
  • Figure 4: Collection of CodeHaluEval benchmark based on a verification-identification-construction process.
  • Figure 5: The performance of 17 LLMs on different types of hallucinations and the overall hallucination rate.

Theorems & Definitions (8)

  • Definition 1: Code Hallucinations
  • Definition 2: Code Errors
  • Remark 3: Code Hallucinations vs. Code Errors
  • Definition 4: Mapping Hallucinations
  • Definition 5: Naming Hallucinations
  • Definition 6: Resource Hallucinations
  • Definition 7: Logic Hallucinations
  • Remark 8: Discussion of Rationality