Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models
Marc Bruni, Fabio Gabrielli, Mohammad Ghafari, Martin Kropp
TL;DR
The paper tackles the vulnerability risk in LLM-generated code and investigates whether prompt engineering can mitigate security issues without changing underlying models. It introduces an automated benchmark that leverages the LLMSecEval and SecurityEval datasets and static analyzers (Semgrep, CodeQL) to evaluate Python code across GPT-3.5-turbo, GPT-4o-mini, and GPT-4o. Key findings show that a simple security-focused prompt prefix can reduce vulnerabilities by up to 56% for GPT-4o-family, and the Recursive Criticism and Improvement (RCI) technique can repair a substantial portion of flaws across models, with large gains especially on more capable models. The authors also propose a Prompt Agent to integrate these techniques into real development workflows and release open-source tooling to enable broader adoption and benchmarking.
Abstract
Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a "prompt agent" that demonstrates how the most effective techniques can be applied in real-world development workflows.
