Table of Contents
Fetching ...

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Abstract

Semantic routers in LLM inference gateways select tools in the critical request path, where every millisecond of added latency compounds across millions of requests. We propose Outcome-Aware Tool Selection (OATS), which interpolates tool embeddings toward the centroid of queries where they historically succeed -- an offline process that adds no parameters, latency, or GPU cost at serving time. On MetaTool (199~tools, 4,287~queries), this improves NDCG@5 from 0.869 to 0.940; on ToolBench (2,413~APIs), from 0.834 to 0.848. We also evaluate two learned extensions: a 2,625-parameter MLP re-ranker and a 197K-parameter contrastive adapter. The MLP re-ranker hurts or matches baseline when outcome data is sparse relative to the tool set; the contrastive adapter provides comparable gains on MetaTool (NDCG@5: 0.931). All methods are evaluated on the same held-out 30\% test split. The practical takeaway is to start with the zero-cost refinement and add learned components only when data density warrants it. All mechanisms run within single-digit millisecond CPU budgets.

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Abstract

Semantic routers in LLM inference gateways select tools in the critical request path, where every millisecond of added latency compounds across millions of requests. We propose Outcome-Aware Tool Selection (OATS), which interpolates tool embeddings toward the centroid of queries where they historically succeed -- an offline process that adds no parameters, latency, or GPU cost at serving time. On MetaTool (199~tools, 4,287~queries), this improves NDCG@5 from 0.869 to 0.940; on ToolBench (2,413~APIs), from 0.834 to 0.848. We also evaluate two learned extensions: a 2,625-parameter MLP re-ranker and a 197K-parameter contrastive adapter. The MLP re-ranker hurts or matches baseline when outcome data is sparse relative to the tool set; the contrastive adapter provides comparable gains on MetaTool (NDCG@5: 0.931). All methods are evaluated on the same held-out 30\% test split. The practical takeaway is to start with the zero-cost refinement and add learned components only when data density warrants it. All mechanisms run within single-digit millisecond CPU budgets.
Paper Structure (82 sections, 10 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 82 sections, 10 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) LLM-based tool selection requires a GPU-bound orchestrator in the request path (500--2,000 ms). (b) OATS performs selection on CPU in the router (3--7 ms), with all learning offline. LLM serving latency is the same in both cases.
  • Figure 2: The OATS pipeline. Top: the online serving path runs on CPU in 3--7 ms. Bottom: offline learning from outcome logs. The core mechanism (Stage 1, dashed arrow) refines tool embeddings in the Tool DB at zero serving cost. Stages 2 and 3 are ablation mechanisms that optionally update the Re-Rank and Embed components respectively.
  • Figure 3: Geometry of OATS-S1 for the buildbetter example. The original embedding (blue circle) sits in a generic "SaaS" region, far from the test query (star). Positive training queries $Q^+$ (teal dots) cluster around "meeting transcripts"; negative queries $Q^-$ (red dots) cluster around "call management." The refinement pulls the tool embedding toward $\bar{e}^+$ and away from $\bar{e}^-$, placing the refined embedding (teal circle) closer to the test query. The description text never changes.
  • Figure 4: Stage 1 convergence over iterations. Left: MetaTool. Right: ToolBench.
  • Figure 5: Selection performance across all methods. Left: MetaTool. Right: ToolBench.

Theorems & Definitions (2)

  • Definition 1: Tool Selection as Retrieval
  • Definition 2: Tool Selection as Decision-Making