Table of Contents
Fetching ...

SecAlign: Defending Against Prompt Injection with Preference Optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo

TL;DR

SecAlign tackles the pressing problem of prompt-injection attacks in LLM-enabled systems by formulating defense as preference optimization. It constructs a preference dataset that pairs prompt-injected inputs with secure and insecure outputs and applies Direct Preference Optimization to train models to prefer secure outputs without sacrificing utility. Across optimization-free and optimization-based attacks, SecAlign achieves state-of-the-art security (often <10% ASR) while maintaining performance on standard benchmarks, and demonstrates strong generalization to unseen attacks and domains. The work links LLM security to alignment, shows practical deployment potential with LoRA-based fine-tuning, and opens avenues for multi-modal and real-world defense integration. It also provides an open-source implementation, signaling tangible impact for robust, secure LLM-integrations in software systems.

Abstract

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign

SecAlign: Defending Against Prompt Injection with Preference Optimization

TL;DR

SecAlign tackles the pressing problem of prompt-injection attacks in LLM-enabled systems by formulating defense as preference optimization. It constructs a preference dataset that pairs prompt-injected inputs with secure and insecure outputs and applies Direct Preference Optimization to train models to prefer secure outputs without sacrificing utility. Across optimization-free and optimization-based attacks, SecAlign achieves state-of-the-art security (often <10% ASR) while maintaining performance on standard benchmarks, and demonstrates strong generalization to unseen attacks and domains. The work links LLM security to alignment, shows practical deployment potential with LoRA-based fine-tuning, and opens avenues for multi-modal and real-world defense integration. It also provides an open-source implementation, signaling tangible impact for robust, secure LLM-integrations in software systems.

Abstract

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign
Paper Structure (48 sections, 6 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 48 sections, 6 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Top: We formulate defense against prompt injection as a preference optimization problem. Given a prompt-injected input with the injected instruction highlighted in red, the LLM is fine-tuned to prefer the response to the instruction over the response to the injection. Bottom: Our proposed SecAlign reduces the attack success rate of the strongest tested prompt injection to 8% without hurting the utility from Llama3-8B-Instruct dubey2024llama, an advanced LLM. In comparison, state-of-the-art (SOTA) prompting-based defense In-Context wei2023jailbreak, see \ref{['tab:baselines']}, and fine-tuning-based defense StruQ chen2024struq achieve very limited security with utility loss.
  • Figure 2: The log probability of desirable vs. undesirable outputs. SecAlign achieves a much larger margin between them, indicating a stronger robustness to prompt injections. Results are from Llama-7B experiments.
  • Figure 3: The utility (WinRate) and security (ASR) of SecAlign compared to StruQ on Instruct models. SecAlign LLMs maintain high utility from the undefended LLMs and significantly surpass StruQ LLMs in security, especially under strong optimization-based attacks. See numbers in \ref{['tab:mainextend']}.
  • Figure 4: The utility (WinRate) and security (ASR) of SecAlign compared to StruQ on base models. See numbers in \ref{['tab:mainextend']}.
  • Figure 5: GCG loss of all tested samples on Llama3-8B-Instruct. The center solid line shows average loss and the shaded region shows standard deviation across samples. SecAlign LLM is much harder to attack: in the end, the attack loss is still higher than that at the start of StruQ.
  • ...and 1 more figures