Table of Contents
Fetching ...

From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection

Junxiao Han, Zheng Yu, Lingfeng Bao, Jiakun Liu, Yao Wan, Jianwei Yin, Shuiguang Deng, Song Han

TL;DR

This study systematically evaluates LLMs and LLM-based agents for security patch detection using three methods—Plain LLM, Data-Aug LLM, and ReAct Agent—across commercial and open-source models on the PatchDB dataset. Data-Aug LLM offers the strongest overall performance, while the ReAct Agent excels at minimizing false positives through iterative reasoning. Commercial LLMs consistently outperform open-source alternatives, though baselines like PatchFinder can achieve high accuracy at the cost of higher FPR. The work provides practical guidance on prompting strategies and context window sizing, with implications for deploying automated patch detection in real-world software security workflows.

Abstract

The widespread adoption of open-source software (OSS) has accelerated software innovation but also increased security risks due to the rapid propagation of vulnerabilities and silent patch releases. In recent years, large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering (SE) tasks, enabling them to effectively address software security challenges such as vulnerability detection. However, systematic evaluation of the capabilities of LLMs and LLM-based agents in security patch detection remains limited. To bridge this gap, we conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection. Specifically, we investigate three methods: Plain LLM (a single LLM with a system prompt), Data-Aug LLM (data augmentation based on the Plain LLM), and the ReAct Agent (leveraging the thought-action-observation mechanism). We also evaluate the performance of both commercial and open-source LLMs under these methods and compare these results with those of existing baselines. Furthermore, we analyze the detection performance of these methods across various vulnerability types, and examine the impact of different prompting strategies and context window sizes on the results. Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR). Although baseline methods exhibit strong accuracy, their false positive rates are significantly higher. In contrast, our evaluated methods achieve comparable accuracy while substantially reducing the FPR. These findings provide valuable insights into the practical applications of LLMs and LLM-based agents in security patch detection, highlighting their advantage in maintaining robust performance while minimizing false positive rates.

From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection

TL;DR

This study systematically evaluates LLMs and LLM-based agents for security patch detection using three methods—Plain LLM, Data-Aug LLM, and ReAct Agent—across commercial and open-source models on the PatchDB dataset. Data-Aug LLM offers the strongest overall performance, while the ReAct Agent excels at minimizing false positives through iterative reasoning. Commercial LLMs consistently outperform open-source alternatives, though baselines like PatchFinder can achieve high accuracy at the cost of higher FPR. The work provides practical guidance on prompting strategies and context window sizing, with implications for deploying automated patch detection in real-world software security workflows.

Abstract

The widespread adoption of open-source software (OSS) has accelerated software innovation but also increased security risks due to the rapid propagation of vulnerabilities and silent patch releases. In recent years, large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering (SE) tasks, enabling them to effectively address software security challenges such as vulnerability detection. However, systematic evaluation of the capabilities of LLMs and LLM-based agents in security patch detection remains limited. To bridge this gap, we conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection. Specifically, we investigate three methods: Plain LLM (a single LLM with a system prompt), Data-Aug LLM (data augmentation based on the Plain LLM), and the ReAct Agent (leveraging the thought-action-observation mechanism). We also evaluate the performance of both commercial and open-source LLMs under these methods and compare these results with those of existing baselines. Furthermore, we analyze the detection performance of these methods across various vulnerability types, and examine the impact of different prompting strategies and context window sizes on the results. Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR). Although baseline methods exhibit strong accuracy, their false positive rates are significantly higher. In contrast, our evaluated methods achieve comparable accuracy while substantially reducing the FPR. These findings provide valuable insights into the practical applications of LLMs and LLM-based agents in security patch detection, highlighting their advantage in maintaining robust performance while minimizing false positive rates.

Paper Structure

This paper contains 26 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: The part of the security patch that fixes CVE-2025-48384 (a) and CVE-2024-11680 (b). The red and green lines represent the before-fixed code (pre-patch) and after-fixed code (post-patch), respectively.
  • Figure 2: Overview of our approach.
  • Figure 3: The prompt template used with Plain LLM.
  • Figure 4: The prompt template used with Data-Aug LLM.
  • Figure 5: The prompt template used with ReAct Agent.
  • ...and 9 more figures