Table of Contents
Fetching ...

Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Mohammed Mehedi Hasan, Hao Li, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan

TL;DR

This paper investigates Model Context Protocol (MCP) tool descriptions as a critical interface for FM-based agents and reveals pervasive description smells across real-world MCP servers. It introduces a six-component rubric, an FM-based smell scanner, and a semi-automated augmentation pipeline, then evaluates their impact on performance using the MCP-Universe benchmark. The results show that augmenting tool descriptions improves task success and evaluator scores but increases execution steps and costs, with domain- and model-specific variations; a targeted, compact component set can often match full augmentation with lower overhead. The work highlights the need for cost-aware, component-driven design of tool descriptions and proposes practical tooling (description router, per-component schemas) to optimize MCP workflows in real-world deployments and research.

Abstract

The Model Context Protocol (MCP) standardizes how Foundation Model (FM)-based agents interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM-based agents, their prevalence and consequences in the MCP ecosystem remain unclear. To address this, we conduct the first large-scale empirical study of 856 tools spread across 103 MCP servers, assessing their description quality and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM-based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These findings highlight a trade-off between agent performance and cost, as well as the context sensitivity of the performance gain. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.

Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

TL;DR

This paper investigates Model Context Protocol (MCP) tool descriptions as a critical interface for FM-based agents and reveals pervasive description smells across real-world MCP servers. It introduces a six-component rubric, an FM-based smell scanner, and a semi-automated augmentation pipeline, then evaluates their impact on performance using the MCP-Universe benchmark. The results show that augmenting tool descriptions improves task success and evaluator scores but increases execution steps and costs, with domain- and model-specific variations; a targeted, compact component set can often match full augmentation with lower overhead. The work highlights the need for cost-aware, component-driven design of tool descriptions and proposes practical tooling (description router, per-component schemas) to optimize MCP workflows in real-world deployments and research.

Abstract

The Model Context Protocol (MCP) standardizes how Foundation Model (FM)-based agents interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM-based agents, their prevalence and consequences in the MCP ecosystem remain unclear. To address this, we conduct the first large-scale empirical study of 856 tools spread across 103 MCP servers, assessing their description quality and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM-based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These findings highlight a trade-off between agent performance and cost, as well as the context sensitivity of the performance gain. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.
Paper Structure (48 sections, 1 equation, 12 figures, 8 tables)

This paper contains 48 sections, 1 equation, 12 figures, 8 tables.

Figures (12)

  • Figure 1: MCP workflow for an FM-based agent. When an agent receives a user query, (1) it retrieves tool metadata (name, description, and input schema) via the MCP client; (2) the agent prompts the foundation model (FM) with the user query and retrieved metadata, whereupon the FM plans the solution, formulates the appropriate tool call, and instructs the agent to execute it; (3) the agent executes the tool call via the MCP client; and (4) the agent forwards the tool response to the FM, which synthesizes the final answer for the user.
  • Figure 2: Comparison of two Yahoo Finance MCP tool descriptions used by the same FM-based agent. The original version (a) provides ambiguous guidance, while the forked version (b) clarifies parameter names and formats. This difference in description quality directly influences how the FM selects parameters during tool invocation, affecting data retrieval scope, latency, and overall efficiency.
  • Figure 3: Overview of the study design. The components in bright yellow boxes represent workflows repurposed from the MCP-Universe benchmark for evaluation and benchmarking, while all other components and processes were introduced in this study.
  • Figure 4: Tool Description for the Sequential Thinking tool.
  • Figure 5: Scoring instrumentation for the Purpose component. To ensure granular evaluation, this 5-point Likert scoring is applied independently to each of the six components.
  • ...and 7 more figures