Pros and Cons! Evaluating ChatGPT on Software Vulnerability
Xin Yin
TL;DR
The paper presents a zero-shot, multitask evaluation of ChatGPT on software vulnerability tasks using the Big-Vul dataset, benchmarking against SOTA methods across detection, assessment, localization, repair, and description. It finds that state-of-the-art approaches generally outperform ChatGPT, though providing additional contextual information improves certain assessments and descriptions, and ChatGPT shows some capability in localization with CWE-type variability. The authors propose a rigorous evaluation framework and release reproducibility resources to guide future improvements in LLM-based vulnerability handling, highlighting the need for better vulnerability understanding and description capabilities. The work offers practical insights for refining prompts, context provision, and model alignment to enhance SV-related reasoning in LLMs.
Abstract
This paper proposes a pipeline for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available dataset. We carry out an extensive technical evaluation of ChatGPT using Big-Vul covering five different common software vulnerability tasks. We evaluate the multitask and multilingual aspects of ChatGPT based on this dataset. We found that the existing state-of-the-art methods are generally superior to ChatGPT in software vulnerability detection. Although ChatGPT improves accuracy when providing context information, it still has limitations in accurately predicting severity ratings for certain CWE types. In addition, ChatGPT demonstrates some ability in locating vulnerabilities for certain CWE types, but its performance varies among different CWE types. ChatGPT exhibits limited vulnerability repair capabilities in both providing and not providing context information. Finally, ChatGPT shows uneven performance in generating CVE descriptions for various CWE types, with limited accuracy in detailed information. Overall, though ChatGPT performs well in some aspects, it still needs improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities in order to fully realize its potential. Our evaluation framework provides valuable insights for further enhancing ChatGPT' s software vulnerability handling capabilities.
