ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights
Richard Osuagwu, Thomas Cook, Maraim Masoud, Koustav Ghosal, Riccardo Mattivi
TL;DR
Tool calling by LLMs in regulated fintech environments faces challenges from on-premises constraints, strict compliance, and overlapping internal toolsets. The authors present ScaleCall, a prototype framework that benchmarks embedding-based retrieval, prompt-based listwise ranking, and hybrids, employing a domain-aware evaluation strategy. Key findings show that retrieval performance hinges on domain factors: embedding-based methods deliver lower latency for large tool repositories, listwise prompting improves disambiguation for overlapping tools, and hybrid approaches offer context-specific benefits; prompt enrichment markedly boosts performance. The study provides practical deployment insights for fintech, outlining a roadmap for robust, compliant tool-calling systems in regulated industries and informing design choices for enterprise AI orchestration.
Abstract
While Large Language Models (LLMs) excel at tool calling, deploying these capabilities in regulated enterprise environments such as fintech presents unique challenges due to on-premises constraints, regulatory compliance requirements, and the need to disambiguate large, functionally overlapping toolsets. In this paper, we present a comprehensive study of tool retrieval methods for enterprise environments through the development and deployment of ScaleCall, a prototype tool-calling framework within Mastercard designed for orchestrating internal APIs and automating data engineering workflows. We systematically evaluate embedding-based retrieval, prompt-based listwise ranking, and hybrid approaches, revealing that method effectiveness depends heavily on domain-specific factors rather than inherent algorithmic superiority. Through empirical investigation on enterprise-derived benchmarks, we find that embedding-based methods offer superior latency for large tool repositories, while listwise ranking provides better disambiguation for overlapping functionalities, with hybrid approaches showing promise in specific contexts. We integrate our findings into ScaleCall's flexible architecture and validate the framework through real-world deployment in Mastercard's regulated environment. Our work provides practical insights into the trade-offs between retrieval accuracy, computational efficiency, and operational requirements, contributing to the understanding of tool-calling system design for enterprise applications in regulated industries.
