WildCode: An Empirical Analysis of Code Generated by ChatGPT
Kobra Khanmohammadi, Pooria Roy, Raphael Khoury, Abdelwahab Hamou-Lhadj, Wilfried Patrick Konan
TL;DR
This paper presents the first large-scale empirical study of code generated by ChatGPT using real-world interactions from the WildChat dataset. It combines syntax validation, OpenGrep-based security analysis, and user-intent classification to characterize both code quality and how users engage with AI-generated code. The findings reveal pervasive security vulnerabilities, notable memory- and deserialization issues, frequent ReDoS risk in regexes, and a substantial presence of hallucinated modules, alongside a clear gap in user attention to secure coding. The work underscores the need for security-aware LLMs, better prompting strategies, and proactive safeguards in AI-assisted programming, and provides a reproducible dataset pipeline for future research.
Abstract
LLM models are increasingly used to generate code, but the quality and security of this code are often uncertain. Several recent studies have raised alarm bells, indicating that such AI-generated code may be particularly vulnerable to cyberattacks. However, most of these studies rely on code that is generated specifically for the study, which raises questions about the realism of such experiments. In this study, we perform a large-scale empirical analysis of real-life code generated by ChatGPT. We evaluate code generated by ChatGPT both with respect to correctness and security and delve into the intentions of users who request code from the model. Our research confirms previous studies that used synthetic queries and yielded evidence that LLM-generated code is often inadequate with respect to security. We also find that users exhibit little curiosity about the security features of the code they ask LLMs to generate, as evidenced by their lack of queries on this topic.
