Table of Contents
Fetching ...

Budget-Aware Tool-Use Enables Effective Agent Scaling

Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen, Ke Jiang, Samira Daruki, Yi Liang, William Yang Wang, Tomas Pfister, Chen-Yu Lee

TL;DR

This paper tackles how to scale tool-use for tool-augmented agents under explicit tool-call budgets, showing that naive budget increases fail without budget awareness. It introduces Budget Tracker, a lightweight plug-in that surfaces real-time budget signals, and BATS, a budget-aware framework that dynamically tunes planning and self-verification based on remaining resources, together with a unified cost metric combining token and tool costs. Through experiments on web-search benchmarks across multiple backbones, budget-aware methods push the cost-performance Pareto frontier and achieve better efficiency under budget constraints without additional training. These findings offer a principled, transparent foundation for designing scalable, resource-aware tool-using agents.

Abstract

Scaling test-time computation improves performance across different tasks on large language models (LLMs), which has also been extended to tool-augmented agents. For these agents, scaling involves not only "thinking" in tokens but also "acting" via tool calls. The number of tool calls directly bounds the agent's interaction with the external environment. However, we find that simply granting agents a larger tool-call budget fails to improve performance, as they lack "budget awareness" and quickly hit a performance ceiling. To address this, we study how to scale such agents effectively under explicit tool-call budgets, focusing on web search agents. We first introduce the Budget Tracker, a lightweight plug-in that provides the agent with continuous budget awareness, enabling simple yet effective scaling. We further develop BATS (Budget Aware Test-time Scaling), an advanced framework that leverages this awareness to dynamically adapt its planning and verification strategy, deciding whether to "dig deeper" on a promising lead or "pivot" to new paths based on remaining resources. To analyze cost-performance scaling in a controlled manner, we formalize a unified cost metric that jointly accounts for token and tool consumption. We provide the first systematic study on budget-constrained agents, showing that budget-aware methods produce more favorable scaling curves and push the cost-performance Pareto frontier. Our work offers empirical insights toward a more transparent and principled understanding of scaling in tool-augmented agents.

Budget-Aware Tool-Use Enables Effective Agent Scaling

TL;DR

This paper tackles how to scale tool-use for tool-augmented agents under explicit tool-call budgets, showing that naive budget increases fail without budget awareness. It introduces Budget Tracker, a lightweight plug-in that surfaces real-time budget signals, and BATS, a budget-aware framework that dynamically tunes planning and self-verification based on remaining resources, together with a unified cost metric combining token and tool costs. Through experiments on web-search benchmarks across multiple backbones, budget-aware methods push the cost-performance Pareto frontier and achieve better efficiency under budget constraints without additional training. These findings offer a principled, transparent foundation for designing scalable, resource-aware tool-using agents.

Abstract

Scaling test-time computation improves performance across different tasks on large language models (LLMs), which has also been extended to tool-augmented agents. For these agents, scaling involves not only "thinking" in tokens but also "acting" via tool calls. The number of tool calls directly bounds the agent's interaction with the external environment. However, we find that simply granting agents a larger tool-call budget fails to improve performance, as they lack "budget awareness" and quickly hit a performance ceiling. To address this, we study how to scale such agents effectively under explicit tool-call budgets, focusing on web search agents. We first introduce the Budget Tracker, a lightweight plug-in that provides the agent with continuous budget awareness, enabling simple yet effective scaling. We further develop BATS (Budget Aware Test-time Scaling), an advanced framework that leverages this awareness to dynamically adapt its planning and verification strategy, deciding whether to "dig deeper" on a promising lead or "pivot" to new paths based on remaining resources. To analyze cost-performance scaling in a controlled manner, we formalize a unified cost metric that jointly accounts for token and tool consumption. We provide the first systematic study on budget-constrained agents, showing that budget-aware methods produce more favorable scaling curves and push the cost-performance Pareto frontier. Our work offers empirical insights toward a more transparent and principled understanding of scaling in tool-augmented agents.

Paper Structure

This paper contains 43 sections, 2 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Budget Tracker is a lightweight plug-in that can be applied to both a standard ReAct agent (top) and more advanced orchestration frameworks like BATS (bottom). In this figure, blue boxes highlight modules that adapt to the budget.
  • Figure 2: At each interaction round, the agent is provided with its current and remaining budget through the budget tracker before generating the next thinking step and the tool call actions.
  • Figure 3: ReAct saturates and fails to utilize additional tool budget, reaching a performance ceiling. In contrast, ReAct + Budget Tracker continues to scale effectively with larger budgets, achieving consistent accuracy improvements.
  • Figure 4: Comparison of Budget Tracker and ReAct in sequential scaling using Gemini-2.5-Pro. With explicit budget awareness, Budget Tracker consistently improves upon ReAct at equal budgets. ReAct plateaus early as it cannot utilize extra resources, while Budget Tracker adapts its spending to gain further improvements and extend the cost-performance frontier.
  • Figure 5: Comparison of Budget Tracker and ReAct in parallel scaling using Gemini-2.5-Pro. The left subfigure shows accuracy scaling with increasing parallel runs, while the right subfigure illustrates the corresponding cost–performance trend.
  • ...and 12 more figures