Table of Contents
Fetching ...

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Xin Zhou, Ting Zhang, David Lo

TL;DR

Addresses whether large language models can be used for vulnerability detection in code. Proposes in-context learning with GPT-3.5 and GPT-4 and a suite of prompts including CWE-based knowledge and sample retrieval. Finds GPT-3.5 can match prior state-of-the-art with properly designed prompts, and GPT-4 can surpass it by a substantial margin, albeit with higher query costs. Provides a replication package and outlines future directions for local specialized LLMs, robustness, long-tail CWE handling, and human-AI collaboration.

Abstract

Previous learning-based vulnerability detection methods relied on either medium-sized pre-trained models or smaller neural networks from scratch. Recent advancements in Large Pre-Trained Language Models (LLMs) have showcased remarkable few-shot learning capabilities in various tasks. However, the effectiveness of LLMs in detecting software vulnerabilities is largely unexplored. This paper aims to bridge this gap by exploring how LLMs perform with various prompts, particularly focusing on two state-of-the-art LLMs: GPT-3.5 and GPT-4. Our experimental results showed that GPT-3.5 achieves competitive performance with the prior state-of-the-art vulnerability detection approach and GPT-4 consistently outperformed the state-of-the-art.

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

TL;DR

Addresses whether large language models can be used for vulnerability detection in code. Proposes in-context learning with GPT-3.5 and GPT-4 and a suite of prompts including CWE-based knowledge and sample retrieval. Finds GPT-3.5 can match prior state-of-the-art with properly designed prompts, and GPT-4 can surpass it by a substantial margin, albeit with higher query costs. Provides a replication package and outlines future directions for local specialized LLMs, robustness, long-tail CWE handling, and human-AI collaboration.

Abstract

Previous learning-based vulnerability detection methods relied on either medium-sized pre-trained models or smaller neural networks from scratch. Recent advancements in Large Pre-Trained Language Models (LLMs) have showcased remarkable few-shot learning capabilities in various tasks. However, the effectiveness of LLMs in detecting software vulnerabilities is largely unexplored. This paper aims to bridge this gap by exploring how LLMs perform with various prompts, particularly focusing on two state-of-the-art LLMs: GPT-3.5 and GPT-4. Our experimental results showed that GPT-3.5 achieves competitive performance with the prior state-of-the-art vulnerability detection approach and GPT-4 consistently outperformed the state-of-the-art.
Paper Structure (7 sections, 3 tables)