Table of Contents
Fetching ...

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung

TL;DR

CostBench addresses the gap in evaluating cost-aware planning for LLM tool-use agents operating in dynamic environments. It presents a scalable, cost-centric benchmark built around the travel-planning domain, featuring atomic and composite tools with randomized costs and a dynamic blocking module that includes tool failures, cost changes, preference changes, and tool removals. The benchmark defines multiple cost-sensitive metrics and compares ten models, revealing substantial weaknesses: even GPT-5 achieves under 75% exact match on hardest static tasks, with performance dropping further under cost-change conditions and dynamic disruptions, highlighting sensitivity to cost noise and replanning challenges. By providing a flexible generation pipeline, dynamic interaction environment, and comprehensive analysis, CostBench offers a roadmap for developing economically rational and robust LLM agents.

Abstract

Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

TL;DR

CostBench addresses the gap in evaluating cost-aware planning for LLM tool-use agents operating in dynamic environments. It presents a scalable, cost-centric benchmark built around the travel-planning domain, featuring atomic and composite tools with randomized costs and a dynamic blocking module that includes tool failures, cost changes, preference changes, and tool removals. The benchmark defines multiple cost-sensitive metrics and compares ten models, revealing substantial weaknesses: even GPT-5 achieves under 75% exact match on hardest static tasks, with performance dropping further under cost-change conditions and dynamic disruptions, highlighting sensitivity to cost noise and replanning challenges. By providing a flexible generation pipeline, dynamic interaction environment, and comprehensive analysis, CostBench offers a roadmap for developing economically rational and robust LLM agents.

Abstract

Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

Paper Structure

This paper contains 1 section, 1 table.

Table of Contents

  1. Introduction