Table of Contents
Fetching ...

Code Hallucination

Mirza Masfiqur Rahman, Ashish Kundu

TL;DR

The paper investigates code hallucination in large-language-model–driven code generation, formalizing the phenomenon and showing it is pervasive across black-box models. It introduces HallTrigger, a semi-automated prompting framework that pairs program-analysis with meta-prompts and reward signals to induce and study hallucinations without access to model internals. Through multiple case studies across three black-box LLMs, the work demonstrates a range of hallucination types in whole-code generation, from inflated algorithms to runtime errors, and analyzes the behavior when humans provide code for analysis. The findings highlight serious implications for software development and emphasize a need for robust evaluation benchmarks and remediation strategies, including static/dynamic analysis, to mitigate code hallucination. The work also outlines future directions for automation of trigger prompts and more principled remediation techniques to improve reliability of code-generating LLMs.

Abstract

Generative models such as large language models are extensively used as code copilots and for whole program generation. However, the programs they generate often have questionable correctness, authenticity and reliability in terms of integration as they might not follow the user requirements, provide incorrect and/or nonsensical outputs, or even contain semantic/syntactic errors - overall known as LLM hallucination. In this work, we present several types of code hallucination. We have generated such hallucinated code manually using large language models. We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination. Our method leverages 3 different dynamic attributes of LLMs to craft prompts that can successfully trigger hallucinations from models without the need to access model architecture or parameters. Results from popular blackbox models suggest that HallTrigger is indeed effective and the pervasive LLM hallucination have sheer impact on software development.

Code Hallucination

TL;DR

The paper investigates code hallucination in large-language-model–driven code generation, formalizing the phenomenon and showing it is pervasive across black-box models. It introduces HallTrigger, a semi-automated prompting framework that pairs program-analysis with meta-prompts and reward signals to induce and study hallucinations without access to model internals. Through multiple case studies across three black-box LLMs, the work demonstrates a range of hallucination types in whole-code generation, from inflated algorithms to runtime errors, and analyzes the behavior when humans provide code for analysis. The findings highlight serious implications for software development and emphasize a need for robust evaluation benchmarks and remediation strategies, including static/dynamic analysis, to mitigate code hallucination. The work also outlines future directions for automation of trigger prompts and more principled remediation techniques to improve reliability of code-generating LLMs.

Abstract

Generative models such as large language models are extensively used as code copilots and for whole program generation. However, the programs they generate often have questionable correctness, authenticity and reliability in terms of integration as they might not follow the user requirements, provide incorrect and/or nonsensical outputs, or even contain semantic/syntactic errors - overall known as LLM hallucination. In this work, we present several types of code hallucination. We have generated such hallucinated code manually using large language models. We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination. Our method leverages 3 different dynamic attributes of LLMs to craft prompts that can successfully trigger hallucinations from models without the need to access model architecture or parameters. Results from popular blackbox models suggest that HallTrigger is indeed effective and the pervasive LLM hallucination have sheer impact on software development.
Paper Structure (10 sections, 7 figures, 1 table)

This paper contains 10 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: ChatGPT generated memorized solution fails to follow simple output requirements and runs into compilation error.
  • Figure 2: Case 1: Incorrect algorithm suggestion for prompts asking unachievable computational complexity.
  • Figure 3: (Case 2) Incorrect algorithm suggestion by ChatGPT for prompts asking loose computational complexity.
  • Figure 4: (Case 2) Incorrect algorithm suggestion by Gemini with apparent correct test case, for prompts asking loose computational complexity.
  • Figure 5: (Case 8) Repetitive line count mistake by Gemini for simple python program.
  • ...and 2 more figures