Table of Contents
Fetching ...

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo

TL;DR

This work tackles prompt-injection threats in LLM-integrated systems by delivering Meta SecAlign, the first fully open-source LLM with built-in model-level defense that approaches commercial-grade performance. The authors introduce SecAlign++—a training recipe that enforces a secure prompt-data separation via a new input role, randomized injection positioning, and self-generated labeling—to maintain utility while delivering strong PI resistance. Extensive evaluations across nine utility and seven security benchmarks, including agentic workflows, show near-state-of-the-art defense with minimal utility loss, and robust generalization to unseen tasks. The results demonstrate the practicality of open-source secure foundation LLMs for secure agentic applications and invite community-driven attacks/defenses research.

Abstract

Prompt injection attack has been listed as the top-1 security threat to LLM-integrated applications, which interact with external environment data for complex tasks. The untrusted data may contain an injected prompt trying to arbitrarily manipulate the system. Model-level prompt injection defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source secure models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigating prompt injection attacks. To this end, we develop Meta SecAlign, the first fully open-source LLM with built-in model-level defense that achieves commercial-grade performance, powerful enough for complex agentic tasks. We provide complete details of our training recipe, an improved version of the SOTA SecAlign defense. We perform the most comprehensive evaluation to date on 9 utility benchmarks and 7 security benchmarks on general knowledge, instruction following, and agentic workflows. Results show that Meta SecAlign, despite being trained on generic instruction-tuning samples only, surprisingly confers security in unseen downstream tasks, including tool-calling and web-navigation, in addition to general instruction-following. Our best model -- Meta-SecAlign-70B -- establishes a new frontier of utility-security trade-off for open-source LLMs. Even compared to closed-course commercial models such as GPT-5, our model is much securer than most of them. Below are links for the code (https://github.com/facebookresearch/Meta_SecAlign), Meta-SecAlign-70B(https://huggingface.co/facebook/Meta-SecAlign-70B), and Meta-SecAlign-8B(https://huggingface.co/facebook/Meta-SecAlign-8B) models.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

TL;DR

This work tackles prompt-injection threats in LLM-integrated systems by delivering Meta SecAlign, the first fully open-source LLM with built-in model-level defense that approaches commercial-grade performance. The authors introduce SecAlign++—a training recipe that enforces a secure prompt-data separation via a new input role, randomized injection positioning, and self-generated labeling—to maintain utility while delivering strong PI resistance. Extensive evaluations across nine utility and seven security benchmarks, including agentic workflows, show near-state-of-the-art defense with minimal utility loss, and robust generalization to unseen tasks. The results demonstrate the practicality of open-source secure foundation LLMs for secure agentic applications and invite community-driven attacks/defenses research.

Abstract

Prompt injection attack has been listed as the top-1 security threat to LLM-integrated applications, which interact with external environment data for complex tasks. The untrusted data may contain an injected prompt trying to arbitrarily manipulate the system. Model-level prompt injection defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source secure models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigating prompt injection attacks. To this end, we develop Meta SecAlign, the first fully open-source LLM with built-in model-level defense that achieves commercial-grade performance, powerful enough for complex agentic tasks. We provide complete details of our training recipe, an improved version of the SOTA SecAlign defense. We perform the most comprehensive evaluation to date on 9 utility benchmarks and 7 security benchmarks on general knowledge, instruction following, and agentic workflows. Results show that Meta SecAlign, despite being trained on generic instruction-tuning samples only, surprisingly confers security in unseen downstream tasks, including tool-calling and web-navigation, in addition to general instruction-following. Our best model -- Meta-SecAlign-70B -- establishes a new frontier of utility-security trade-off for open-source LLMs. Even compared to closed-course commercial models such as GPT-5, our model is much securer than most of them. Below are links for the code (https://github.com/facebookresearch/Meta_SecAlign), Meta-SecAlign-70B(https://huggingface.co/facebook/Meta-SecAlign-70B), and Meta-SecAlign-8B(https://huggingface.co/facebook/Meta-SecAlign-8B) models.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Utility ($\uparrow$, y-axis) and security (attack success rate $\downarrow$, x-axis) of state-of-the-art (SOTA) open-source or closed-source LLMs with prompt injection security. Meta-SecAlign-70B achieves near-zero attack success rates on prompt injection in instruction following (much securer than all others), and agentic tool-calling and web-navigation (comparable to the recent GPT-5 with the high reasoning in both utility and security). Meta-SecAlign-70B is the first open-source prompt-injection-robust LLM that is strong enough for complex agentic workflows, where prompt injection threat mostly lies. SecAlign-70B uses the open-source SOTA defense SecAlign chen2024aligning to fine-tune Llama-3.3-70B-Instruct, the initialization LLM for our Meta-SecAlign-70B. GPT-5/GPT-4o (from OpenAI) and Gemini-Flash-2.5 (from Google) are SOTA commercial LLMs with a claimed implementation of prompt injection defense openai2025gpt5shi2025lessons. For WASP evtimov2025wasp, we report its End2End attack success rates, on which Meta-SecAlign-70B and GPT-5 both emerge as an ideal model with 0% attack success rates and around 60% utility, so the dots are overlapped.
  • Figure 2: The utility-security trade-off when tuning LoRA $\alpha$. Utility is an average across 9 utility benchmarks. ASR is an average across 7 security benchmarks. Both the utility and ASR averages are weighted by the number of samples in each benchmark.
  • Figure 3: Tuning the LoRA $\alpha$ at test time is effective to control Meta-SecAlign-70B security (above) and utility (below). Numbers are at in \ref{['tab:loraalpha']}.
  • Figure 4: Security (attack success rate $\downarrow$) of LLMs with different instruction-following capabilities (Llama-3.1-8B-Instruct < Llama-3.1-70B-Instruct < Llama-3.3-70B-Instruct). Stronger LLMs are more vulnerable to PI attacks when undefended (top), but could be fine-tuned to a similar level of robustness (bottom). Detailed numbers are present in \ref{['tab:scaling_model_size']}.
  • Figure 5: Tuning learning rate at training time can also control the utility(left)-security(right) trade-off for SecAlign++ on Llama-3.3-70B-Instruct.