Table of Contents
Fetching ...

We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems

Junfeng Fang, Zijun Yao, Ruipeng Wang, Haokai Ma, Xiang Wang, Tat-Seng Chua

TL;DR

This position paper focuses on the safety risks introduced by Model Context Protocol (MCP) in LLM-powered agent systems, highlighting the threat from unvetted third-party MCP services. It introduces SafeMCP, a diagnostic framework with a controllable environment, attack-defense mechanisms, and standardized metrics to evaluate MCP safety. Through pilot experiments, it demonstrates real, non-trivial safety risks and shows that both passive detection and basic defenses are insufficient, while active defense helps but has limitations. The authors propose a six-pronged roadmap—spanning red teaming, safe backbone LLMs, evaluation, data accumulation, service safeguards, and ecosystem governance—to build safer MCP ecosystems and encourage cross-sector collaboration.

Abstract

The development of large language models (LLMs) has entered in a experience-driven era, flagged by the emergence of environment feedback-driven learning via reinforcement learning and tool-using agents. This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data. However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks. In particular, MCP introduces third-party services, which are not controlled by the LLM developers, into the agent systems. These third-party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user-agent interactions. In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP-powered agent systems. To establish our position, we argue with three key parts. (1) We first construct \framework, a controlled framework to examine safety issues in MCP-powered agent systems. (2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP-powered agent systems is a real threat and its defense is not trivial. (3) Finally, we give our outlook by showing a roadmap to build safe MCP-powered agent systems. In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction. We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction. Our code is available at https://github.com/littlelittlenine/SafeMCP.git.

We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems

TL;DR

This position paper focuses on the safety risks introduced by Model Context Protocol (MCP) in LLM-powered agent systems, highlighting the threat from unvetted third-party MCP services. It introduces SafeMCP, a diagnostic framework with a controllable environment, attack-defense mechanisms, and standardized metrics to evaluate MCP safety. Through pilot experiments, it demonstrates real, non-trivial safety risks and shows that both passive detection and basic defenses are insufficient, while active defense helps but has limitations. The authors propose a six-pronged roadmap—spanning red teaming, safe backbone LLMs, evaluation, data accumulation, service safeguards, and ecosystem governance—to build safer MCP ecosystems and encourage cross-sector collaboration.

Abstract

The development of large language models (LLMs) has entered in a experience-driven era, flagged by the emergence of environment feedback-driven learning via reinforcement learning and tool-using agents. This encourages the emergenece of model context protocol (MCP), which defines the standard on how should a LLM interact with external services, such as \api and data. However, as MCP becomes the de facto standard for LLM agent systems, it also introduces new safety risks. In particular, MCP introduces third-party services, which are not controlled by the LLM developers, into the agent systems. These third-party MCP services provider are potentially malicious and have the economic incentives to exploit vulnerabilities and sabotage user-agent interactions. In this position paper, we advocate the research community in LLM safety to pay close attention to the new safety risks issues introduced by MCP, and develop new techniques to build safe MCP-powered agent systems. To establish our position, we argue with three key parts. (1) We first construct \framework, a controlled framework to examine safety issues in MCP-powered agent systems. (2) We then conduct a series of pilot experiments to demonstrate the safety risks in MCP-powered agent systems is a real threat and its defense is not trivial. (3) Finally, we give our outlook by showing a roadmap to build safe MCP-powered agent systems. In particular, we would call for researchers to persue the following research directions: red teaming, MCP safe LLM development, MCP safety evaluation, MCP safety data accumulation, MCP service safeguard, and MCP safe ecosystem construction. We hope this position paper can raise the awareness of the research community in MCP safety and encourage more researchers to join this important research direction. Our code is available at https://github.com/littlelittlenine/SafeMCP.git.

Paper Structure

This paper contains 24 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overall framework of a mcp safe agent system, including: (1) Upper Left: The differences between mcp introduced safety risks and traditional llm safety risks. (2) Right: The overall architecture of SafeMCP. (3) Bottom Left: Outlook for mcp safety.
  • Figure 2: We show the safety performance with different defense strategies. (a) We show the detection ratio (%) of different llms and safety detection models. (b) We show the attack success rate (ASR) and harm rate (HR) of different attacks with different defense strategies on multiple models before and after the defense. For brevity, prefix "G-", "Q-" and "D-" stands for different backbone llms, representing GPT-4o-mini, Qwen3-14B and Doubao respectively. We use ren, cod, dee, and cip to represent ReNeLLM, CodeChameleon, DeepInception and CipherChat, respectively.
  • Figure 3: We show the relative accuracy loss (RAL) before and after the defense.