A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality
M. Mehdi Kholoosi, M. Ali Babar, Roland Croft
TL;DR
The paper investigates whether ChatGPT can meaningfully assist in software security by examining public perceptions on Twitter and evaluating its practicality as an oracle for vulnerability detection on a carefully curated dataset collected after a known model training cutoff. It combines qualitative, human‑inspired analysis of tweets with a hands‑on evaluation of GPT‑4 outputs against 70 real vulnerabilities drawn from the NVD, 40 CWE types, and 12 languages, using a fixed prompt to elicit security analyses. The authors find that while practitioners are generally optimistic and see potential in tasks like vulnerability detection and information retrieval, ChatGPT often provides generic, context‑poor outputs and achieves only partial vulnerability detection accuracy, highlighting the need for domain‑specific LLMs and more rigorous industrial validation. The study contributes by contrasting perceived value with practical limitations, offering guidance for future research on specialized security LLMs and prompting strategies to improve reliability in security‑critical applications.
Abstract
Artificial Intelligence (AI) advancements have enabled the development of Large Language Models (LLMs) that can perform a variety of tasks with remarkable semantic understanding and accuracy. ChatGPT is one such LLM that has gained significant attention due to its impressive capabilities for assisting in various knowledge-intensive tasks. Due to the knowledge-intensive nature of engineering secure software, ChatGPT's assistance is expected to be explored for security-related tasks during the development/evolution of software. To gain an understanding of the potential of ChatGPT as an emerging technology for supporting software security, we adopted a two-fold approach. Initially, we performed an empirical study to analyse the perceptions of those who had explored the use of ChatGPT for security tasks and shared their views on Twitter. It was determined that security practitioners view ChatGPT as beneficial for various software security tasks, including vulnerability detection, information retrieval, and penetration testing. Secondly, we designed an experiment aimed at investigating the practicality of this technology when deployed as an oracle in real-world settings. In particular, we focused on vulnerability detection and qualitatively examined ChatGPT outputs for given prompts within this prominent software security task. Based on our analysis, responses from ChatGPT in this task are largely filled with generic security information and may not be appropriate for industry use. To prevent data leakage, we performed this analysis on a vulnerability dataset compiled after the OpenAI data cut-off date from real-world projects covering 40 distinct vulnerability types and 12 programming languages. We assert that the findings from this study would contribute to future research aimed at developing and evaluating LLMs dedicated to software security.
