Table of Contents
Fetching ...

HyFunc: Accelerating LLM-based Function Calls for Agentic AI through Hybrid-Model Cascade and Dynamic Templating

Weibin Liao, Jian-guang Lou, Haoyi Xiong

TL;DR

HyFunc tackles high inference latency in LLM-based tool calls by identifying three redundancies: context processing with large function libraries, full-sequence generation by a large model, and fixed syntactic boilerplate. It proposes a hybrid-model cascade that distills user intent into a single soft token via a large model, retrieves relevant functions with a lightweight retriever, and generates the final call with a smaller, prefix-tuned model, aided by dynamic templating to inject boilerplate on demand. The approach is evaluated on an unseen BFCL benchmark, achieving an end-to-end latency of $0.828\mathrm{s}$ and $80.1\%$ accuracy with a $0.6\mathrm{B}$ LMS, outperforming many larger baselines and demonstrating strong gains for compact models. Overall, HyFunc provides a practical, plug-and-play framework for fast, reliable agentic AI by effectively decomposing reasoning and generation and by eliminating predictable syntax during runtime.

Abstract

While agentic AI systems rely on LLMs to translate user intent into structured function calls, this process is fraught with computational redundancy, leading to high inference latency that hinders real-time applications. This paper identifies and addresses three key redundancies: (1) the redundant processing of a large library of function descriptions for every request; (2) the redundant use of a large, slow model to generate an entire, often predictable, token sequence; and (3) the redundant generation of fixed, boilerplate parameter syntax. We introduce HyFunc, a novel framework that systematically eliminates these inefficiencies. HyFunc employs a hybrid-model cascade where a large model distills user intent into a single "soft token." This token guides a lightweight retriever to select relevant functions and directs a smaller, prefix-tuned model to generate the final call, thus avoiding redundant context processing and full-sequence generation by the large model. To eliminate syntactic redundancy, our "dynamic templating" technique injects boilerplate parameter syntax on-the-fly within an extended vLLM engine. To avoid potential limitations in generalization, we evaluate HyFunc on an unseen benchmark dataset, BFCL. Experimental results demonstrate that HyFunc achieves an excellent balance between efficiency and performance. It achieves an inference latency of 0.828 seconds, outperforming all baseline models, and reaches a performance of 80.1%, surpassing all models with a comparable parameter scale. These results suggest that HyFunc offers a more efficient paradigm for agentic AI. Our code is publicly available at https://github.com/MrBlankness/HyFunc.

HyFunc: Accelerating LLM-based Function Calls for Agentic AI through Hybrid-Model Cascade and Dynamic Templating

TL;DR

HyFunc tackles high inference latency in LLM-based tool calls by identifying three redundancies: context processing with large function libraries, full-sequence generation by a large model, and fixed syntactic boilerplate. It proposes a hybrid-model cascade that distills user intent into a single soft token via a large model, retrieves relevant functions with a lightweight retriever, and generates the final call with a smaller, prefix-tuned model, aided by dynamic templating to inject boilerplate on demand. The approach is evaluated on an unseen BFCL benchmark, achieving an end-to-end latency of and accuracy with a LMS, outperforming many larger baselines and demonstrating strong gains for compact models. Overall, HyFunc provides a practical, plug-and-play framework for fast, reliable agentic AI by effectively decomposing reasoning and generation and by eliminating predictable syntax during runtime.

Abstract

While agentic AI systems rely on LLMs to translate user intent into structured function calls, this process is fraught with computational redundancy, leading to high inference latency that hinders real-time applications. This paper identifies and addresses three key redundancies: (1) the redundant processing of a large library of function descriptions for every request; (2) the redundant use of a large, slow model to generate an entire, often predictable, token sequence; and (3) the redundant generation of fixed, boilerplate parameter syntax. We introduce HyFunc, a novel framework that systematically eliminates these inefficiencies. HyFunc employs a hybrid-model cascade where a large model distills user intent into a single "soft token." This token guides a lightweight retriever to select relevant functions and directs a smaller, prefix-tuned model to generate the final call, thus avoiding redundant context processing and full-sequence generation by the large model. To eliminate syntactic redundancy, our "dynamic templating" technique injects boilerplate parameter syntax on-the-fly within an extended vLLM engine. To avoid potential limitations in generalization, we evaluate HyFunc on an unseen benchmark dataset, BFCL. Experimental results demonstrate that HyFunc achieves an excellent balance between efficiency and performance. It achieves an inference latency of 0.828 seconds, outperforming all baseline models, and reaches a performance of 80.1%, surpassing all models with a comparable parameter scale. These results suggest that HyFunc offers a more efficient paradigm for agentic AI. Our code is publicly available at https://github.com/MrBlankness/HyFunc.
Paper Structure (37 sections, 9 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 9 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the key differences between existing function call paradigm and our proposed redundancy-reduced paradigm.
  • Figure 2: High-level architecture of the HyFunc framework, including Offline Preparation and Online Inference, showing the flow from user prompt to the final function call through the two models and retrieval step.
  • Figure 3: Illustration of the three core strategies in the HyFunc framework. (a) The LML performs a single forward pass to produce semantic embeddings: function embeddings are derived via mean pooling over their token hidden states, while the user's intent is distilled into the hidden state of the first generated "soft token". (b) A dual-encoder MLP-based retriever is trained with a contrastive loss to align soft token and function embeddings. During inference, it uses cosine similarity to efficiently select relevant functions. (c) The LMS is fine-tuned using the soft token as a continuous prompt. A projector maps the soft token from the LML's space to the LMS's space, guiding the smaller model to generate the final, structured function call with reduced context.
  • Figure 4: Performance improvement and time consumption reduction of Dynamic Templating on various backbone LLMs.
  • Figure 5: Case study of Dynamic Templating.
  • ...and 1 more figures