Table of Contents
Fetching ...

An LLM-Tool Compiler for Fused Parallel Function Calling

Simranjit Singh, Andreas Karatzas, Michael Fore, Iraklis Anagnostopoulos, Dimitrios Stamoulis

TL;DR

The paper tackles latency and cost challenges in multi-tool LLM Copilots by introducing the LLM-Tool Compiler, a GPT-driven module that fuses similar tool operations into unified tasks. This fusion, realized via a Fuser and an Executor, increases parallelization and reduces token usage and latency without modifying the underlying function-calling API or prompting schemes. Across a large geospatial benchmark, the approach yields up to 4–5× more parallel calls and up to ~40% token-cost reduction with ~12% latency improvement, while maintaining overall task performance within variance. The work also discusses deployment considerations, ablation insights, and future directions, including local execution and graph-based dependency modeling, to further enhance real-world Copilot efficiency.

Abstract

State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional prompting to segment tasks into multiple steps, each requiring a round-trip to the GPT APIs, leads to increased system latency and costs. Although recent advancements in parallel function calling have improved tool execution per API call, they may necessitate more detailed in-context instructions and task breakdown at the prompt level, resulting in higher engineering and production costs. Inspired by the hardware design principles of multiply-add (MAD) operations, which fuse multiple arithmetic operations into a single task from the compiler's perspective, we propose LLM-Tool Compiler, which selectively fuses similar types of tool operations under a single function at runtime, presenting them as a unified task to the LLM. This selective fusion inherently enhances parallelization and efficiency. Benchmarked on a large-scale Copilot platform, LLM-Tool Compiler achieves up to four times more parallel calls than existing methods, reducing token costs and latency by up to 40% and 12%, respectively.

An LLM-Tool Compiler for Fused Parallel Function Calling

TL;DR

The paper tackles latency and cost challenges in multi-tool LLM Copilots by introducing the LLM-Tool Compiler, a GPT-driven module that fuses similar tool operations into unified tasks. This fusion, realized via a Fuser and an Executor, increases parallelization and reduces token usage and latency without modifying the underlying function-calling API or prompting schemes. Across a large geospatial benchmark, the approach yields up to 4–5× more parallel calls and up to ~40% token-cost reduction with ~12% latency improvement, while maintaining overall task performance within variance. The work also discusses deployment considerations, ablation insights, and future directions, including local execution and graph-based dependency modeling, to further enhance real-world Copilot efficiency.

Abstract

State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional prompting to segment tasks into multiple steps, each requiring a round-trip to the GPT APIs, leads to increased system latency and costs. Although recent advancements in parallel function calling have improved tool execution per API call, they may necessitate more detailed in-context instructions and task breakdown at the prompt level, resulting in higher engineering and production costs. Inspired by the hardware design principles of multiply-add (MAD) operations, which fuse multiple arithmetic operations into a single task from the compiler's perspective, we propose LLM-Tool Compiler, which selectively fuses similar types of tool operations under a single function at runtime, presenting them as a unified task to the LLM. This selective fusion inherently enhances parallelization and efficiency. Benchmarked on a large-scale Copilot platform, LLM-Tool Compiler achieves up to four times more parallel calls than existing methods, reducing token costs and latency by up to 40% and 12%, respectively.
Paper Structure (8 sections, 2 figures, 5 tables)

This paper contains 8 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Distributions of the number of tool calls on the GeoLLM-Engine-5k for the baseline vs. LLM-Tool Compiler with GPT-4 ReAct - Zero-Shot prompting.
  • Figure 2: As a preliminary analysis, we capture whether a simple LUT-based modeling approach can capture overall agent latency. Shown below is the profiled and prediced average runtime per task across the different baselines considered in our experiments.