Table of Contents
Fetching ...

MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants

John Heibel, Daniel Lowd

TL;DR

MaPP investigates a Malicious Programming Prompt attack that injects attacker text into a coding LLM's prompt to induce insecure code generation. The authors evaluate general and CWE-specific vulnerabilities across seven LLMs using the HumanEval benchmark and the Asleep at the Keyboard dataset, showing high MaPP effectiveness and that scaling models does not prevent such attacks. They argue that current safety finetuning is insufficient against prompt-level manipulation and call for stronger prompt integrity, auditing, and restricted tool usage to mitigate risks in real-world, agentic LLM deployments. The study highlights a practical security risk as LLM-based programming assistants increasingly integrate external data sources and tools.

Abstract

LLM-based programming assistants offer the promise of programming faster but with the risk of introducing more security vulnerabilities. Prior work has studied how LLMs could be maliciously fine-tuned to suggest vulnerabilities more often. With the rise of agentic LLMs, which may use results from an untrusted third party, there is a growing risk of attacks on the model's prompt. We introduce the Malicious Programming Prompt (MaPP) attack, in which an attacker adds a small amount of text to a prompt for a programming task (under 500 bytes). We show that our prompt strategy can cause an LLM to add vulnerabilities while continuing to write otherwise correct code. We evaluate three prompts on seven common LLMs, from basic to state-of-the-art commercial models. Using the HumanEval benchmark, we find that our prompts are broadly effective, with no customization required for different LLMs. Furthermore, the LLMs that are best at HumanEval are also best at following our malicious instructions, suggesting that simply scaling language models will not prevent MaPP attacks. Using a dataset of eight CWEs in 16 scenarios, we find that MaPP attacks are also effective at implementing specific and targeted vulnerabilities across a range of models. Our work highlights the need to secure LLM prompts against manipulation as well as rigorously auditing code generated with the help of LLMs.

MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants

TL;DR

MaPP investigates a Malicious Programming Prompt attack that injects attacker text into a coding LLM's prompt to induce insecure code generation. The authors evaluate general and CWE-specific vulnerabilities across seven LLMs using the HumanEval benchmark and the Asleep at the Keyboard dataset, showing high MaPP effectiveness and that scaling models does not prevent such attacks. They argue that current safety finetuning is insufficient against prompt-level manipulation and call for stronger prompt integrity, auditing, and restricted tool usage to mitigate risks in real-world, agentic LLM deployments. The study highlights a practical security risk as LLM-based programming assistants increasingly integrate external data sources and tools.

Abstract

LLM-based programming assistants offer the promise of programming faster but with the risk of introducing more security vulnerabilities. Prior work has studied how LLMs could be maliciously fine-tuned to suggest vulnerabilities more often. With the rise of agentic LLMs, which may use results from an untrusted third party, there is a growing risk of attacks on the model's prompt. We introduce the Malicious Programming Prompt (MaPP) attack, in which an attacker adds a small amount of text to a prompt for a programming task (under 500 bytes). We show that our prompt strategy can cause an LLM to add vulnerabilities while continuing to write otherwise correct code. We evaluate three prompts on seven common LLMs, from basic to state-of-the-art commercial models. Using the HumanEval benchmark, we find that our prompts are broadly effective, with no customization required for different LLMs. Furthermore, the LLMs that are best at HumanEval are also best at following our malicious instructions, suggesting that simply scaling language models will not prevent MaPP attacks. Using a dataset of eight CWEs in 16 scenarios, we find that MaPP attacks are also effective at implementing specific and targeted vulnerabilities across a range of models. Our work highlights the need to secure LLM prompts against manipulation as well as rigorously auditing code generated with the help of LLMs.
Paper Structure (22 sections, 5 figures, 4 tables)

This paper contains 22 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A malicious adversary may be able to change LLM behavior through prompting, either by directly modifying the system prompt, crafting text that's retrieved and processed by a RAG (retrieval-augmented generation) system, or by designing an external tool that generates harmful instructions. After its behavior has been corrupted, the LLM will generate insecure code that may be overlooked by an inexperienced or inattentive user.
  • Figure 2: Structure of the prompts used for the randseed, ExFil, and MemLeak tests. The control tests have the same system and user prompts, but with no malicious insert.
  • Figure 3: Fraction of tests in the HumanEval benchmark where the LLM generated the appropriate vulnerability, as specified in the scenario's MaPP. With the exception of Llama 3 8B and GPT-3.5, all LLMs generate all three attacks more than 95% of the time.
  • Figure 4: Fraction of tests passed in HumanEval benchmark for each combination of LLM and prompt. Blue bars indicates the pass rate for each LLM with a non-malicious prompt. The remaining three bars indicate the rate of passing the benchmark when using a malicious prompt. The yellow/green/red portion of each bar indicates the cases where the test is passed and the vulnerability is included. The blue bar stacked on top indicates cases where the benchmark was passed but the vulnerability was not generated.
  • Figure 5: Fraction of tests in the Asleep at the Keyboard benchmark where the LLM generated the appropriate vulnerability, as specified in the scenario's MaPP. Outputs were not checked for correctness, only the implementation of a vulnerability.