Table of Contents
Fetching ...

Exploring ChatGPT's Capabilities on Vulnerability Management

Peiyu Liu, Junming Liu, Lirong Fu, Kangjie Lu, Yifan Xia, Xuhong Zhang, Wenzhi Chen, Haiqin Weng, Shouling Ji, Wenhai Wang

TL;DR

The paper tackles whether ChatGPT can support the full vulnerability management lifecycle, including bug report handling, severity assessment, and patch validation. It performs a large-scale evaluation across 6 tasks with 70,346 samples, comparing ChatGPT to 11 SOTA baselines and exploring diverse prompt designs, including a self-heuristic prompt that extracts knowledge from demonstrations. Key findings show that ChatGPT can outperform some SOTA tasks (notably bug report summarization) and that prompt design and model choice significantly affect performance, with self-heuristic prompts offering substantial gains in challenging tasks yet risking information misinterpretation. The work highlights practical pathways for integrating ChatGPT into vulnerability management while outlining bottlenecks such as hallucinations and context limitations, guiding future research on prompt strategies and knowledge extraction.

Abstract

Recently, ChatGPT has attracted great attention from the code analysis domain. Prior works show that ChatGPT has the capabilities of processing foundational code analysis tasks, such as abstract syntax tree generation, which indicates the potential of using ChatGPT to comprehend code syntax and static behaviors. However, it is unclear whether ChatGPT can complete more complicated real-world vulnerability management tasks, such as the prediction of security relevance and patch correctness, which require an all-encompassing understanding of various aspects, including code syntax, program semantics, and related manual comments. In this paper, we explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. For each task, we compare ChatGPT against SOTA approaches, investigate the impact of different prompts, and explore the difficulties. The results suggest promising potential in leveraging ChatGPT to assist vulnerability management. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Furthermore, our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions. For instance, directly providing random demonstration examples in the prompt cannot consistently guarantee good performance in vulnerability management. By contrast, leveraging ChatGPT in a self-heuristic way -- extracting expertise from demonstration examples itself and integrating the extracted expertise in the prompt is a promising research direction. Besides, ChatGPT may misunderstand and misuse the information in the prompt. Consequently, effectively guiding ChatGPT to focus on helpful information rather than the irrelevant content is still an open problem.

Exploring ChatGPT's Capabilities on Vulnerability Management

TL;DR

The paper tackles whether ChatGPT can support the full vulnerability management lifecycle, including bug report handling, severity assessment, and patch validation. It performs a large-scale evaluation across 6 tasks with 70,346 samples, comparing ChatGPT to 11 SOTA baselines and exploring diverse prompt designs, including a self-heuristic prompt that extracts knowledge from demonstrations. Key findings show that ChatGPT can outperform some SOTA tasks (notably bug report summarization) and that prompt design and model choice significantly affect performance, with self-heuristic prompts offering substantial gains in challenging tasks yet risking information misinterpretation. The work highlights practical pathways for integrating ChatGPT into vulnerability management while outlining bottlenecks such as hallucinations and context limitations, guiding future research on prompt strategies and knowledge extraction.

Abstract

Recently, ChatGPT has attracted great attention from the code analysis domain. Prior works show that ChatGPT has the capabilities of processing foundational code analysis tasks, such as abstract syntax tree generation, which indicates the potential of using ChatGPT to comprehend code syntax and static behaviors. However, it is unclear whether ChatGPT can complete more complicated real-world vulnerability management tasks, such as the prediction of security relevance and patch correctness, which require an all-encompassing understanding of various aspects, including code syntax, program semantics, and related manual comments. In this paper, we explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. For each task, we compare ChatGPT against SOTA approaches, investigate the impact of different prompts, and explore the difficulties. The results suggest promising potential in leveraging ChatGPT to assist vulnerability management. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Furthermore, our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions. For instance, directly providing random demonstration examples in the prompt cannot consistently guarantee good performance in vulnerability management. By contrast, leveraging ChatGPT in a self-heuristic way -- extracting expertise from demonstration examples itself and integrating the extracted expertise in the prompt is a promising research direction. Besides, ChatGPT may misunderstand and misuse the information in the prompt. Consequently, effectively guiding ChatGPT to focus on helpful information rather than the irrelevant content is still an open problem.
Paper Structure (20 sections, 6 figures, 14 tables)

This paper contains 20 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: The vulnerability management process.
  • Figure 2: Evaluation pipeline.
  • Figure 3: An example of the expertise prompt. After removing the bold pink text, the rest represents the general-info prompt.
  • Figure 4: An example of the knowledge summarized by ChatGPT.
  • Figure 5: Results of user study (1 - Poor, 2 - Marginal, 3 - Acceptable, 4 - Good, 5 - Excellent).
  • ...and 1 more figures