Table of Contents
Fetching ...

Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows

Edwin Chen, Zulekha Bibi

TL;DR

MLAT introduces a design pattern that exposes pre-trained statistical ML models as callable tools within LLM agent workflows, enabling contextual invocation and interpretive reasoning around predictions. The approach is validated by PitchCraft, a two-agent system that uses an XGBoost pricing tool to rapidly generate professional proposals, achieving $R^2 = 0.807$ on held-out data and a commercially meaningful relative MAE around 22.6% despite N = 70 and synthetic augmentation. The framework relies on a structured output architecture with Gemini JSON schemas to bridge LLM reasoning and ML inputs, and demonstrates strong real-world impact with significant reductions in proposal generation time. MLAT generalizes to domains requiring quantitative estimation plus contextual reasoning, offering a practical and upgradeable approach to integrating statistical models into conversational AI workflows.

Abstract

We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre-trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional pipelines that treat ML inference as a static preprocessing step, MLAT positions the model as a first-class tool alongside web search, database queries, and APIs, enabling the LLM to decide when and how to use it based on conversational context. To validate MLAT, we present PitchCraft, a pilot production system that converts discovery call recordings into professional proposals with ML-predicted pricing. The system uses two agents: a Research Agent that gathers prospect intelligence via parallel tool calls, and a Draft Agent that invokes an XGBoost pricing model as a tool call and generates a complete proposal through structured outputs. The pricing model, trained on 70 examples combining real and human-verified synthetic data, achieves R^2 = 0.807 on held-out data with a mean absolute error of 3688 USD. The system reduces proposal generation time from multiple hours to under 10 minutes. We describe the MLAT framework, structured output architecture, training methodology under extreme data scarcity, and sensitivity analysis demonstrating meaningful learned relationships. MLAT generalizes to domains requiring quantitative estimation combined with contextual reasoning.

Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows

TL;DR

MLAT introduces a design pattern that exposes pre-trained statistical ML models as callable tools within LLM agent workflows, enabling contextual invocation and interpretive reasoning around predictions. The approach is validated by PitchCraft, a two-agent system that uses an XGBoost pricing tool to rapidly generate professional proposals, achieving on held-out data and a commercially meaningful relative MAE around 22.6% despite N = 70 and synthetic augmentation. The framework relies on a structured output architecture with Gemini JSON schemas to bridge LLM reasoning and ML inputs, and demonstrates strong real-world impact with significant reductions in proposal generation time. MLAT generalizes to domains requiring quantitative estimation plus contextual reasoning, offering a practical and upgradeable approach to integrating statistical models into conversational AI workflows.

Abstract

We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre-trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional pipelines that treat ML inference as a static preprocessing step, MLAT positions the model as a first-class tool alongside web search, database queries, and APIs, enabling the LLM to decide when and how to use it based on conversational context. To validate MLAT, we present PitchCraft, a pilot production system that converts discovery call recordings into professional proposals with ML-predicted pricing. The system uses two agents: a Research Agent that gathers prospect intelligence via parallel tool calls, and a Draft Agent that invokes an XGBoost pricing model as a tool call and generates a complete proposal through structured outputs. The pricing model, trained on 70 examples combining real and human-verified synthetic data, achieves R^2 = 0.807 on held-out data with a mean absolute error of 3688 USD. The system reduces proposal generation time from multiple hours to under 10 minutes. We describe the MLAT framework, structured output architecture, training methodology under extreme data scarcity, and sensitivity analysis demonstrating meaningful learned relationships. MLAT generalizes to domains requiring quantitative estimation combined with contextual reasoning.
Paper Structure (48 sections, 2 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 48 sections, 2 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: PitchCraft architecture: a single LLM workflow with two Gemini agents. Agent 1 (Research Agent) analyzes the Fireflies transcript and performs parallel tool calls to Firecrawl and Perplexity APIs, outputting structured JSON. Agent 2 (Draft Agent) extracts ML features into the model's input schema, invokes the XGBoost pricing model as an MLAT tool call (green, below), reasons about the prediction, and generates the complete proposal via structured output parsing.
  • Figure 2: Target variable distribution across training and test sets. Left: Histogram showing the right-skewed price distribution. Training (blue) and test (orange) sets show comparable coverage. Right: Box plots confirming similar median and interquartile ranges ($n_{\text{train}}=56$, $n_{\text{test}}=14$).
  • Figure 3: Predicted vs. actual price. Left: Training set ($R^2 = 0.937$) shows tight clustering around the identity line. Right: Test set ($R^2 = 0.807$) demonstrates generalization to unseen data, with slight overestimation in the mid-range and underestimation at the highest values---consistent with regression-to-the-mean behavior expected with limited tail data.