Table of Contents
Fetching ...

Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks

Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh, Naser Ezzati-Jivan

TL;DR

This paper identifies a semantic attack surface in the Model Context Protocol where tool descriptors can influence LLM reasoning. It formalizes three attack classes—Tool Poisoning, Shadowing, and Rug Pulls—and evaluates three LLMs (GPT-4, DeepSeek, Llama-3.5) across eight prompting strategies, revealing model- and strategy-dependent safety and latency trade-offs. A layered defense stack, including RSA-based manifest signing, LLM-on-LLM semantic vetting, and lightweight runtime guardrails, is proposed and empirically validated, showing significant improvements in blocking unsafe tool invocations without requiring model fine-tuning. The work contributes a formal threat model, a reproducible cross-model evaluation pipeline, and actionable protocol-level defenses that advance secure, scalable deployment of tool-augmented LLM agents.

Abstract

The Model Context Protocol (MCP) enables Large Language Models to integrate external tools through structured descriptors, increasing autonomy in decision-making, task execution, and multi-agent workflows. However, this autonomy creates a largely overlooked security gap. Existing defenses focus on prompt-injection attacks and fail to address threats embedded in tool metadata, leaving MCP-based systems exposed to semantic manipulation. This work analyzes three classes of semantic attacks on MCP-integrated systems: (1) Tool Poisoning, where adversarial instructions are hidden in tool descriptors; (2) Shadowing, where trusted tools are indirectly compromised through contaminated shared context; and (3) Rug Pulls, where descriptors are altered after approval to subvert behavior. To counter these threats, we introduce a layered security framework with three components: RSA-based manifest signing to enforce descriptor integrity, LLM-on-LLM semantic vetting to detect suspicious tool definitions, and lightweight heuristic guardrails that block anomalous tool behavior at runtime. Through evaluation of GPT-4, DeepSeek, and Llama-3.5 across eight prompting strategies, we find that security performance varies widely by model architecture and reasoning method. GPT-4 blocks about 71 percent of unsafe tool calls, balancing latency and safety. DeepSeek shows the highest resilience to Shadowing attacks but with greater latency, while Llama-3.5 is fastest but least robust. Our results show that the proposed framework reduces unsafe tool invocation rates without model fine-tuning or internal modification.

Securing the Model Context Protocol: Defending LLMs Against Tool Poisoning and Adversarial Attacks

TL;DR

This paper identifies a semantic attack surface in the Model Context Protocol where tool descriptors can influence LLM reasoning. It formalizes three attack classes—Tool Poisoning, Shadowing, and Rug Pulls—and evaluates three LLMs (GPT-4, DeepSeek, Llama-3.5) across eight prompting strategies, revealing model- and strategy-dependent safety and latency trade-offs. A layered defense stack, including RSA-based manifest signing, LLM-on-LLM semantic vetting, and lightweight runtime guardrails, is proposed and empirically validated, showing significant improvements in blocking unsafe tool invocations without requiring model fine-tuning. The work contributes a formal threat model, a reproducible cross-model evaluation pipeline, and actionable protocol-level defenses that advance secure, scalable deployment of tool-augmented LLM agents.

Abstract

The Model Context Protocol (MCP) enables Large Language Models to integrate external tools through structured descriptors, increasing autonomy in decision-making, task execution, and multi-agent workflows. However, this autonomy creates a largely overlooked security gap. Existing defenses focus on prompt-injection attacks and fail to address threats embedded in tool metadata, leaving MCP-based systems exposed to semantic manipulation. This work analyzes three classes of semantic attacks on MCP-integrated systems: (1) Tool Poisoning, where adversarial instructions are hidden in tool descriptors; (2) Shadowing, where trusted tools are indirectly compromised through contaminated shared context; and (3) Rug Pulls, where descriptors are altered after approval to subvert behavior. To counter these threats, we introduce a layered security framework with three components: RSA-based manifest signing to enforce descriptor integrity, LLM-on-LLM semantic vetting to detect suspicious tool definitions, and lightweight heuristic guardrails that block anomalous tool behavior at runtime. Through evaluation of GPT-4, DeepSeek, and Llama-3.5 across eight prompting strategies, we find that security performance varies widely by model architecture and reasoning method. GPT-4 blocks about 71 percent of unsafe tool calls, balancing latency and safety. DeepSeek shows the highest resilience to Shadowing attacks but with greater latency, while Llama-3.5 is fastest but least robust. Our results show that the proposed framework reduces unsafe tool invocation rates without model fine-tuning or internal modification.

Paper Structure

This paper contains 72 sections, 53 equations, 9 figures, 22 tables, 1 algorithm.

Figures (9)

  • Figure 1: System architecture pipeline for MCP-integrated toolchains.
  • Figure 2: Threat model for Tool Poisoning, Shadowing, and Rug Pull attacks in MCP-based LLM.
  • Figure 3: Average Prompt Length vs. Unsafe Tool Invocation Across Strategies.
  • Figure 4: Mean tool invocation latency across prompting strategies.
  • Figure 5: Latency distribution across LLMs.
  • ...and 4 more figures