Table of Contents
Fetching ...

HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing

Yukai Zhao, Menghan Wu, Xing Hu, Xin Xia

TL;DR

This work targets the security risk of package hallucinations in LLM-assisted coding by introducing HFuzzer, a phrase-based fuzzing framework that generates diverse, code-related tasks through phrase compositions ⟨Object, Predicate, Complement⟩. It combines a seed-driven fuzzing loop, two-stage hallucination triggering, and a quantitative HS-based evaluation to identify when LLMs propose non-existent or cross-language packages. Across nine tester-target model pairings, HFuzzer triggers package hallucinations more effectively than GPTFuzzer-A, finding on average 2.60× more unique hallucinated packages and delivering higher task diversity; GPT-4o alone yielded 46 unique hallucinations. The study also analyzes case-study results, discusses data contamination and parameter effects, and suggests mitigation directions such as RAG, self-feedback, and fine-tuning, highlighting practical implications for reducing package-hallucination risks in real-world development workflows.

Abstract

Large Language Models (LLMs) are widely used for code generation, but they face critical security risks when applied to practical production due to package hallucinations, in which LLMs recommend non-existent packages. These hallucinations can be exploited in software supply chain attacks, where malicious attackers exploit them to register harmful packages. It is critical to test LLMs for package hallucinations to mitigate package hallucinations and defend against potential attacks. Although researchers have proposed testing frameworks for fact-conflicting hallucinations in natural language generation, there is a lack of research on package hallucinations. To fill this gap, we propose HFUZZER, a novel phrase-based fuzzing framework to test LLMs for package hallucinations. HFUZZER adopts fuzzing technology and guides the model to infer a wider range of reasonable information based on phrases, thereby generating enough and diverse coding tasks. Furthermore, HFUZZER extracts phrases from package information or coding tasks to ensure the relevance of phrases and code, thereby improving the relevance of generated tasks and code. We evaluate HFUZZER on multiple LLMs and find that it triggers package hallucinations across all selected models. Compared to the mutational fuzzing framework, HFUZZER identifies 2.60x more unique hallucinated packages and generates more diverse tasks. Additionally, when testing the model GPT-4o, HFUZZER finds 46 unique hallucinated packages. Further analysis reveals that for GPT-4o, LLMs exhibit package hallucinations not only during code generation but also when assisting with environment configuration.

HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing

TL;DR

This work targets the security risk of package hallucinations in LLM-assisted coding by introducing HFuzzer, a phrase-based fuzzing framework that generates diverse, code-related tasks through phrase compositions ⟨Object, Predicate, Complement⟩. It combines a seed-driven fuzzing loop, two-stage hallucination triggering, and a quantitative HS-based evaluation to identify when LLMs propose non-existent or cross-language packages. Across nine tester-target model pairings, HFuzzer triggers package hallucinations more effectively than GPTFuzzer-A, finding on average 2.60× more unique hallucinated packages and delivering higher task diversity; GPT-4o alone yielded 46 unique hallucinations. The study also analyzes case-study results, discusses data contamination and parameter effects, and suggests mitigation directions such as RAG, self-feedback, and fine-tuning, highlighting practical implications for reducing package-hallucination risks in real-world development workflows.

Abstract

Large Language Models (LLMs) are widely used for code generation, but they face critical security risks when applied to practical production due to package hallucinations, in which LLMs recommend non-existent packages. These hallucinations can be exploited in software supply chain attacks, where malicious attackers exploit them to register harmful packages. It is critical to test LLMs for package hallucinations to mitigate package hallucinations and defend against potential attacks. Although researchers have proposed testing frameworks for fact-conflicting hallucinations in natural language generation, there is a lack of research on package hallucinations. To fill this gap, we propose HFUZZER, a novel phrase-based fuzzing framework to test LLMs for package hallucinations. HFUZZER adopts fuzzing technology and guides the model to infer a wider range of reasonable information based on phrases, thereby generating enough and diverse coding tasks. Furthermore, HFUZZER extracts phrases from package information or coding tasks to ensure the relevance of phrases and code, thereby improving the relevance of generated tasks and code. We evaluate HFUZZER on multiple LLMs and find that it triggers package hallucinations across all selected models. Compared to the mutational fuzzing framework, HFUZZER identifies 2.60x more unique hallucinated packages and generates more diverse tasks. Additionally, when testing the model GPT-4o, HFUZZER finds 46 unique hallucinated packages. Further analysis reveals that for GPT-4o, LLMs exhibit package hallucinations not only during code generation but also when assisting with environment configuration.

Paper Structure

This paper contains 29 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An Example of Package Hallucination
  • Figure 2: The overview of HFuzzer. Phrase Extraction is discussed in Section \ref{['sec:seed_pool']}, Seed Selection is discussed in Section \ref{['sec:selection']}, Task Generation is discussed in Section \ref{['sec:generation']}, Hallucination Triggering is discussed in Section \ref{['sec:query']}, Hallucination Evaluation is discussed in Section \ref{['sec:check']}, Power Adjustment is discussed in Section \ref{['sec:modify']}, and Seed Pool Expansion is discussed in Section \ref{['sec:extract']}.
  • Figure 3: RQ1: Average PHR of Different Target Models
  • Figure 4: RQ1: Heatmap of PHR with Different Models
  • Figure 5: RQ2: Average Diversity index of tasks generated under different DBSCAN parameter settings ($\varepsilon \in {0.1,0.2,0.3}$, minS$\in {1,3,5}$).
  • ...and 3 more figures