Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows
Edwin Chen, Zulekha Bibi
TL;DR
MLAT introduces a design pattern that exposes pre-trained statistical ML models as callable tools within LLM agent workflows, enabling contextual invocation and interpretive reasoning around predictions. The approach is validated by PitchCraft, a two-agent system that uses an XGBoost pricing tool to rapidly generate professional proposals, achieving $R^2 = 0.807$ on held-out data and a commercially meaningful relative MAE around 22.6% despite N = 70 and synthetic augmentation. The framework relies on a structured output architecture with Gemini JSON schemas to bridge LLM reasoning and ML inputs, and demonstrates strong real-world impact with significant reductions in proposal generation time. MLAT generalizes to domains requiring quantitative estimation plus contextual reasoning, offering a practical and upgradeable approach to integrating statistical models into conversational AI workflows.
Abstract
We introduce Machine Learning as a Tool (MLAT), a design pattern in which pre-trained statistical machine learning models are exposed as callable tools within large language model (LLM) agent workflows. This allows an orchestrating agent to invoke quantitative predictions when needed and reason about their outputs in context. Unlike conventional pipelines that treat ML inference as a static preprocessing step, MLAT positions the model as a first-class tool alongside web search, database queries, and APIs, enabling the LLM to decide when and how to use it based on conversational context. To validate MLAT, we present PitchCraft, a pilot production system that converts discovery call recordings into professional proposals with ML-predicted pricing. The system uses two agents: a Research Agent that gathers prospect intelligence via parallel tool calls, and a Draft Agent that invokes an XGBoost pricing model as a tool call and generates a complete proposal through structured outputs. The pricing model, trained on 70 examples combining real and human-verified synthetic data, achieves R^2 = 0.807 on held-out data with a mean absolute error of 3688 USD. The system reduces proposal generation time from multiple hours to under 10 minutes. We describe the MLAT framework, structured output architecture, training methodology under extreme data scarcity, and sensitivity analysis demonstrating meaningful learned relationships. MLAT generalizes to domains requiring quantitative estimation combined with contextual reasoning.
