LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah; Mingji Han; Saurabh Pujar; Hammond Pearce; Ayse Coskun; Gianluca Stringhini

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini

TL;DR

This work introduces SecLLMHolmes, a fully automated evaluation framework to assess whether large language models can reliably identify and reason about security vulnerabilities. It deploys 228 code scenarios across eight CWEs in C and Python to test eight capable LLMs on eight dimensions, including determinism, reasoning faithfulness, and real-world performance. The experimental results reveal non-deterministic outputs, incorrect or unfaithful reasoning, and strong sensitivity to minor code augmentations, with even GPT-4 and PaLM2 failing in real-world cases. The authors release the framework and dataset to enable ongoing benchmarking, underscoring that current LLMs are not yet ready to serve as general-purpose security assistants and outlining concrete directions for improvement.

Abstract

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 11 figures, 27 tables)

This paper contains 28 sections, 4 equations, 11 figures, 27 tables.

Introduction
Background and Related Work
SecLLMHolmes
LLM configuration
LLM Parameters
Prompt Templates
Datasets
Ground-Truth Reasoning $G_r$
Evaluator
Experimental Investigation
Evaluation for Deterministic Responses
Performance Over Range of Parameters
Diversity of Prompts
Faithful Reasoning
Evaluation Over Variety of Vulnerabilities
...and 13 more sections

Figures (11)

Figure 1: LLM chat input format. LLMs operate on a three-part input format: (1) a system prompt, (2) few-shot examples presented as chat history to guide the model's learning, and (3) the specific user input/task to be processed.
Figure 2: SecLLMHolmes Architecture.
Figure 3: Non-deterministic and inconsistent responses by 'codechat-bison@001'.
Figure 4: 'GPT-3.5' responses to standard 'S1' and security-expert like multi-step reasoning 'R2' for CWE-476 $1_v$ code scenario.
Figure 5: 'chat-bison@001' (PaLM2) response for CWE-77 $3_v$ scenario (see Appendix Figure \ref{['fig:cwe-77-3']}) using prompt 'D3' shows unfaithfulness between provided reasoning and final answer.
...and 6 more figures

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

TL;DR

Abstract

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)