Table of Contents
Fetching ...

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, Zhijie Deng

TL;DR

TPS-Bench tackles the challenge of evaluating AI agents on tool planning and scheduling for real-world, compounding tasks. It constructs a heterogeneous MCP-based benchmark with two difficulty levels and uses LLM-as-a-judge to measure task completion and efficiency, including token usage and execution time. The paper presents comprehensive experiments across multiple LLMs, revealing that while planning is reasonable, scheduling strategies markedly affect efficiency, and trade-offs exist between sequential versus parallel tool use. A preliminary reinforcement learning study with GRPO demonstrates meaningful improvements in both speed and accuracy, suggesting a viable path to more efficient tool-augmented agents. The work also provides open-source resources to enable broader replication and advancement of tool planning and scheduling capabilities in LLMs.

Abstract

Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

TL;DR

TPS-Bench tackles the challenge of evaluating AI agents on tool planning and scheduling for real-world, compounding tasks. It constructs a heterogeneous MCP-based benchmark with two difficulty levels and uses LLM-as-a-judge to measure task completion and efficiency, including token usage and execution time. The paper presents comprehensive experiments across multiple LLMs, revealing that while planning is reasonable, scheduling strategies markedly affect efficiency, and trade-offs exist between sequential versus parallel tool use. A preliminary reinforcement learning study with GRPO demonstrates meaningful improvements in both speed and accuracy, suggesting a viable path to more efficient tool-augmented agents. The work also provides open-source resources to enable broader replication and advancement of tool planning and scheduling capabilities in LLMs.

Abstract

Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.

Paper Structure

This paper contains 33 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: TPS-Bench assesses the tool planning and scheduling ability of LLM agents for solving compounding tasks. As shown, the LLM agent needs to first select tools capable of solving the task from a tool repository. Next, the LLM agent decomposes the original task into multiple subtasks and identifies their dependency relationships (subtasks that have no dependencies are marked by different colors in the figure). After that, the LLM agent performs tool calls and collects the tool responses in multiple turns, after which the final answer is delivered.
  • Figure 2: Statistics of subtasks and tools in TPS-Bench. Left: Distribution of all subtask categories in TPS-Bench. Right: Number of tools in each MCP Server used in TPS-Bench.
  • Figure 3: Task construction workflow in TPS-Bench. We send tool names and descriptions to LLMs to generate solvable subtasks, which are then combined by LLMs and inspected manually to form compounding tasks.
  • Figure 4: Left: Pearson correlation between LLM scores and Human scores on Task Completion Rate. The fitted line (red) shows the linear regression trend, while the dashed black line indicates great agreement between LLMs and humans. Right: Pearson correlation between LLM and Humans regarding the number of subtasks decomposed from the same task.
  • Figure 5: Evaluation of three tool selection strategies, no-selection, rule-based selection, and self-selection, was performed by applying each strategy to the tasks in TPS-Bench-Hard. The four models, GLM-4.5, DeepSeek-R1, GPT-4o, and Qwen3-32B, are tested with all three strategies across the benchmark. Efficiency is reflected by the total number of tokens consumed and the time required to complete each task, while effectiveness is reflected by the task completion rate.
  • ...and 2 more figures