Table of Contents
Fetching ...

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching

Yu Nong, Haoran Yang, Long Cheng, Hongxin Hu, Haipeng Cai

TL;DR

APPATCH presents an automated vulnerability patching framework that uses vulnerability semantics reasoning and adaptive prompting to guide LLMs in producing valid patches without test inputs or model fine-tuning. It fuses two phases—Phase 1 exemplar mining based on vulnerability semantics and Phase 2 LLM-guided causal patching with dynamic exemplar selection and multi-faceted validation—underpinned by SDG-based static analysis. Across 97 zero-day and 20 existing vulnerabilities, APPATCH achieves state-of-the-art performance, with up to 36.46% F1 on Zero-Day and 73.86% F1 on ExtractFix, and recall improvements over non-LLM baselines, illustrating practical viability and robustness. The framework emphasizes the importance of semantics-aware scoping, adaptive prompting, and ensemble validation, while acknowledging limitations related to dataset size, compilability, and potential data leakage, and outlining clear extensibility to other CWEs and languages.

Abstract

Timely and effective vulnerability patching is essential for cybersecurity defense, for which various approaches have been proposed yet still struggle to generate valid and correct patches for real-world vulnerabilities. In this paper, we leverage the power and merits of pre-trained language language models (LLMs) to enable automated vulnerability patching using no test input/exploit evidence and without model training/fine-tuning. To elicit LLMs to effectively reason about vulnerable code behaviors, which is essential for quality patch generation, we introduce vulnerability semantics reasoning and adaptive prompting on LLMs and instantiate the methodology as APPATCH, an automated LLM-based patching system. Our evaluation of APPATCH on 97 zero-day vulnerabilities and 20 existing vulnerabilities demonstrates its superior performance to both existing prompting methods and state-of-the-art non-LLM-based techniques (by up to 28.33% in F1 and 182.26% in recall over the best baseline). Through APPATCH, we demonstrate what helps for LLM-based patching and how, as well as discussing what still lacks and why.

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching

TL;DR

APPATCH presents an automated vulnerability patching framework that uses vulnerability semantics reasoning and adaptive prompting to guide LLMs in producing valid patches without test inputs or model fine-tuning. It fuses two phases—Phase 1 exemplar mining based on vulnerability semantics and Phase 2 LLM-guided causal patching with dynamic exemplar selection and multi-faceted validation—underpinned by SDG-based static analysis. Across 97 zero-day and 20 existing vulnerabilities, APPATCH achieves state-of-the-art performance, with up to 36.46% F1 on Zero-Day and 73.86% F1 on ExtractFix, and recall improvements over non-LLM baselines, illustrating practical viability and robustness. The framework emphasizes the importance of semantics-aware scoping, adaptive prompting, and ensemble validation, while acknowledging limitations related to dataset size, compilability, and potential data leakage, and outlining clear extensibility to other CWEs and languages.

Abstract

Timely and effective vulnerability patching is essential for cybersecurity defense, for which various approaches have been proposed yet still struggle to generate valid and correct patches for real-world vulnerabilities. In this paper, we leverage the power and merits of pre-trained language language models (LLMs) to enable automated vulnerability patching using no test input/exploit evidence and without model training/fine-tuning. To elicit LLMs to effectively reason about vulnerable code behaviors, which is essential for quality patch generation, we introduce vulnerability semantics reasoning and adaptive prompting on LLMs and instantiate the methodology as APPATCH, an automated LLM-based patching system. Our evaluation of APPATCH on 97 zero-day vulnerabilities and 20 existing vulnerabilities demonstrates its superior performance to both existing prompting methods and state-of-the-art non-LLM-based techniques (by up to 28.33% in F1 and 182.26% in recall over the best baseline). Through APPATCH, we demonstrate what helps for LLM-based patching and how, as well as discussing what still lacks and why.
Paper Structure (34 sections, 4 equations, 11 figures, 13 tables, 3 algorithms)

This paper contains 34 sections, 4 equations, 11 figures, 13 tables, 3 algorithms.

Figures (11)

  • Figure 1: An example vulnerable program sample where the vulnerable statement is at line 48.
  • Figure 2: GPT-4's patch for the program sample in Figure \ref{['fig:example']} with standard prompting.
  • Figure 3: GPT-4's patch for the sample in Figure \ref{['fig:example']} with a state-of-the-art LLM-based approach zero-shot completion.
  • Figure 4: An overview of Appatch's design, including its inputs, two main working phases, and outputs.
  • Figure 5: Root cause analysis generated by GPT-4 with vulnerability semantics reasoning for the sample in Figure \ref{['fig:example']}.
  • ...and 6 more figures