Table of Contents
Fetching ...

ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights

Richard Osuagwu, Thomas Cook, Maraim Masoud, Koustav Ghosal, Riccardo Mattivi

TL;DR

Tool calling by LLMs in regulated fintech environments faces challenges from on-premises constraints, strict compliance, and overlapping internal toolsets. The authors present ScaleCall, a prototype framework that benchmarks embedding-based retrieval, prompt-based listwise ranking, and hybrids, employing a domain-aware evaluation strategy. Key findings show that retrieval performance hinges on domain factors: embedding-based methods deliver lower latency for large tool repositories, listwise prompting improves disambiguation for overlapping tools, and hybrid approaches offer context-specific benefits; prompt enrichment markedly boosts performance. The study provides practical deployment insights for fintech, outlining a roadmap for robust, compliant tool-calling systems in regulated industries and informing design choices for enterprise AI orchestration.

Abstract

While Large Language Models (LLMs) excel at tool calling, deploying these capabilities in regulated enterprise environments such as fintech presents unique challenges due to on-premises constraints, regulatory compliance requirements, and the need to disambiguate large, functionally overlapping toolsets. In this paper, we present a comprehensive study of tool retrieval methods for enterprise environments through the development and deployment of ScaleCall, a prototype tool-calling framework within Mastercard designed for orchestrating internal APIs and automating data engineering workflows. We systematically evaluate embedding-based retrieval, prompt-based listwise ranking, and hybrid approaches, revealing that method effectiveness depends heavily on domain-specific factors rather than inherent algorithmic superiority. Through empirical investigation on enterprise-derived benchmarks, we find that embedding-based methods offer superior latency for large tool repositories, while listwise ranking provides better disambiguation for overlapping functionalities, with hybrid approaches showing promise in specific contexts. We integrate our findings into ScaleCall's flexible architecture and validate the framework through real-world deployment in Mastercard's regulated environment. Our work provides practical insights into the trade-offs between retrieval accuracy, computational efficiency, and operational requirements, contributing to the understanding of tool-calling system design for enterprise applications in regulated industries.

ScaleCall -- Agentic Tool Calling at Scale for Fintech: Challenges, Methods, and Deployment Insights

TL;DR

Tool calling by LLMs in regulated fintech environments faces challenges from on-premises constraints, strict compliance, and overlapping internal toolsets. The authors present ScaleCall, a prototype framework that benchmarks embedding-based retrieval, prompt-based listwise ranking, and hybrids, employing a domain-aware evaluation strategy. Key findings show that retrieval performance hinges on domain factors: embedding-based methods deliver lower latency for large tool repositories, listwise prompting improves disambiguation for overlapping tools, and hybrid approaches offer context-specific benefits; prompt enrichment markedly boosts performance. The study provides practical deployment insights for fintech, outlining a roadmap for robust, compliant tool-calling systems in regulated industries and informing design choices for enterprise AI orchestration.

Abstract

While Large Language Models (LLMs) excel at tool calling, deploying these capabilities in regulated enterprise environments such as fintech presents unique challenges due to on-premises constraints, regulatory compliance requirements, and the need to disambiguate large, functionally overlapping toolsets. In this paper, we present a comprehensive study of tool retrieval methods for enterprise environments through the development and deployment of ScaleCall, a prototype tool-calling framework within Mastercard designed for orchestrating internal APIs and automating data engineering workflows. We systematically evaluate embedding-based retrieval, prompt-based listwise ranking, and hybrid approaches, revealing that method effectiveness depends heavily on domain-specific factors rather than inherent algorithmic superiority. Through empirical investigation on enterprise-derived benchmarks, we find that embedding-based methods offer superior latency for large tool repositories, while listwise ranking provides better disambiguation for overlapping functionalities, with hybrid approaches showing promise in specific contexts. We integrate our findings into ScaleCall's flexible architecture and validate the framework through real-world deployment in Mastercard's regulated environment. Our work provides practical insights into the trade-offs between retrieval accuracy, computational efficiency, and operational requirements, contributing to the understanding of tool-calling system design for enterprise applications in regulated industries.

Paper Structure

This paper contains 27 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Architecture of the Tool Retrieval Framework. The diagram illustrates both the baseline architecture, Embedding-based Tool Retriever (ETR), and the enhanced architecture incorporating the re-ranking module, referred to as Re-ranked Tool Retriever (RTR). Components highlighted in yellow indicate shared modules utilized by both approaches. Flow (1) corresponds to the ETR pipeline, which generates an ordered list of tools based on embedding similarity scores. Flow (2) represents the extension introduced in RTR, where a large language model (LLM) re-ranker refines the initial ranking. The symbol $\bm{\otimes}$ denotes the embedding similarity operation, such as cosine similarity or dot product.
  • Figure 2: A comparative illustration of two re-ranking architectures. The left side of the diagram (white background) depicts the shared initial embedding stage employed by both our systems (ETR and TTR), and by the original ToolRet framework shi2025retsavvy for retrieving the top-$k$ candidate tools. The divergence occurs in the re-ranking phase. ToolRet applies a Pointwise Cross-Encoder (top-right, green), whereas our proposed method employs a Generative Listwise Re-ranker (bottom-right, blue). The symbol $\bm{\otimes}$ indicates the point at which the user query is combined with the candidate tools for the re-ranking operation.
  • Figure 3: High-level Diagram of ScaleCall Architecture