Table of Contents
Fetching ...

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

TL;DR

The paper tackles the challenge of enterprise tool calling by LLMs in the presence of near-duplicate tools and underspecified user requests. It introduces DiaFORGE, a disambiguation-centric three-stage pipeline consisting of synthetic data generation (UTC-Gen), supervised fine-tuning with reasoning traces, and dynamic evaluation via the DiaBENCH harness. DiaBENCH reveals substantial gains in tool-invocation success over strong baselines under optimized prompting, and the authors provide an open dataset of approximately 5,000 production-grade enterprise API specifications paired with disambiguation-focused dialogues to enable reproducibility and further research. Ablation results highlight the importance of validation cascades, near-duplicate distractor sampling, and thinking traces for robust, enterprise-grade tool calling. This work advances practical, reliable tool-lifting for enterprise LLMs and offers a valuable benchmark and data resource for future development of disambiguation-aware agents.

Abstract

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

TL;DR

The paper tackles the challenge of enterprise tool calling by LLMs in the presence of near-duplicate tools and underspecified user requests. It introduces DiaFORGE, a disambiguation-centric three-stage pipeline consisting of synthetic data generation (UTC-Gen), supervised fine-tuning with reasoning traces, and dynamic evaluation via the DiaBENCH harness. DiaBENCH reveals substantial gains in tool-invocation success over strong baselines under optimized prompting, and the authors provide an open dataset of approximately 5,000 production-grade enterprise API specifications paired with disambiguation-focused dialogues to enable reproducibility and further research. Ablation results highlight the importance of validation cascades, near-duplicate distractor sampling, and thinking traces for robust, enterprise-grade tool calling. This work advances practical, reliable tool-lifting for enterprise LLMs and offers a valuable benchmark and data resource for future development of disambiguation-aware agents.

Abstract

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Paper Structure

This paper contains 63 sections, 27 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: A routine business query can retrieve multiple near-duplicate tools, illustrating the need for fine-grained disambiguation before tool invocation.
  • Figure 2: Data Generation Engine for Disambiguation-Centric Unified Tool-Calling Conversations (UTC-Gen)
  • Figure 3: Trade-offs among tool call-related metrics under Dynamic Evaluation. Marker size & Color $\propto$False-Positive Tool-call Rate (FTR). Models closer to the upper right are preferable; those in the lower left underperform across metrics.
  • Figure 4: DiaFORGE generated dialogue sample
  • Figure 5: Conversation length distribution: number of dialogue turns per sample.
  • ...and 15 more figures