Table of Contents
Fetching ...

Budget-Aware Agentic Routing via Boundary-Guided Training

Caiqi Zhang, Menglin Xia, Xuchao Zhang, Daniel Madrigal, Ankur Mallick, Samuel Kessler, Victor Ruehle, Saravan Rajmohan

TL;DR

Overall, this work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.

Abstract

As large language models (LLMs) evolve into autonomous agents that execute long-horizon workflows, invoking a high-capability model at every step becomes economically unsustainable. While model routing is effective for single-turn queries, agentic routing is a sequential, path-dependent problem: early mistakes compound, feedback is often at the end of the episode, and deployments often demand strict per-task spending limits. We propose Budget-Aware Agentic Routing, which selects between a cheap and an expensive model at each step to optimize the cost--success frontier and to operate under strict per-task budgets. We propose Boundary-Guided Training, which leverages two boundary policies (always-small vs.\ always-large) to build a difficulty taxonomy and to anchor learning under sparse rewards. Our approach warms start with boundary-guided SFT data synthesis via stratified sampling of cost-efficient trajectories, then applies Boundary-Guided Policy Optimization (BoPO), combining boundary-relative rewards with a reference-guided advantage to avoid degenerate cheap-failure solutions. Experiment results show that our method improves the efficiency frontier, matching strong routing baselines at substantially lower cost while demonstrating generalization to strict inference-time budget constraints. Overall, our work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.

Budget-Aware Agentic Routing via Boundary-Guided Training

TL;DR

Overall, this work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.

Abstract

As large language models (LLMs) evolve into autonomous agents that execute long-horizon workflows, invoking a high-capability model at every step becomes economically unsustainable. While model routing is effective for single-turn queries, agentic routing is a sequential, path-dependent problem: early mistakes compound, feedback is often at the end of the episode, and deployments often demand strict per-task spending limits. We propose Budget-Aware Agentic Routing, which selects between a cheap and an expensive model at each step to optimize the cost--success frontier and to operate under strict per-task budgets. We propose Boundary-Guided Training, which leverages two boundary policies (always-small vs.\ always-large) to build a difficulty taxonomy and to anchor learning under sparse rewards. Our approach warms start with boundary-guided SFT data synthesis via stratified sampling of cost-efficient trajectories, then applies Boundary-Guided Policy Optimization (BoPO), combining boundary-relative rewards with a reference-guided advantage to avoid degenerate cheap-failure solutions. Experiment results show that our method improves the efficiency frontier, matching strong routing baselines at substantially lower cost while demonstrating generalization to strict inference-time budget constraints. Overall, our work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.
Paper Structure (27 sections, 15 equations, 5 figures, 5 tables)

This paper contains 27 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Pareto efficiency frontiers across three agentic benchmarks. We plot the Success Rate against the Average Cost per Task ($) for SciWorld, ALFWorld, and AppWorld. BoPO (Ours) consistently pushes the efficiency frontier (top-left) compared to single-turn, cascading, and vanilla RL baselines, achieving comparable success to "Always Large" policies at a fraction of the cost.
  • Figure 2: Component-wise ablation study on the SciWorld benchmark. The plot illustrates the contribution of key technical innovations to the efficiency frontier. The full BoPO method (red) strictly improves upon other ablations.
  • Figure 3: Analysis of budget allocation versus task difficulty. The leftmost bar shows the average difficulty distribution of the datasets. While Random and RL baselines distribute costs inefficiently (wasting resources on Intractable or Easy tasks), BoPO (rightmost bar) strategically funds on Hard tasks (52.2%) where reasoning capabilities yield the highest marginal return.
  • Figure 4: Generalization to open-source model pairs. Pareto efficiency frontier on the SciWorld benchmark using Llama-3.1-8B-Instruct as $\mathcal{M}_{small}$ and Llama-3.1-72B-Instruct as $\mathcal{M}_{large}$. BoPO (red curve) consistently outperforms baselines, demonstrating that our boundary-guided framework is model-agnostic and successfully transfers to architectures with different cost-capability ratios. We use the API provided by together.ai.
  • Figure 5: Cost distribution analysis between static policies. We plot the trajectory cost of $\pi_{large}$ (y-axis) versus $\pi_{small}$ (x-axis) across three benchmarks. Green points ($\mathcal{C}_{large} > \mathcal{C}_{small}$) represent standard cases, while red points indicate cost inversions where the small model incurs higher costs due to long, ineffective failure trajectories. This high variance and frequent inversion necessitate our Normalized Cost formulation, which scales penalties relative to the specific task's difficulty boundaries rather than absolute dollar values.