Table of Contents
Fetching ...

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

Xiaoxin Shi, Jiaxin Wan, Linkang Dong, Wei Jiang, Yue Liu, Zengfeng Huang

TL;DR

SimpleTool is presented, which introduces special tokens that serve a dual role: compressing low-entropy tokens while acting as mode selectors that enable independent parallel generation of function name and arguments, bridging the gap between LLM function calling and latency-critical real-world deployment.

Abstract

LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

TL;DR

SimpleTool is presented, which introduces special tokens that serve a dual role: compressing low-entropy tokens while acting as mode selectors that enable independent parallel generation of function name and arguments, bridging the gap between LLM function calling and latency-critical real-world deployment.

Abstract

LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
Paper Structure (67 sections, 7 equations, 4 figures, 18 tables, 1 algorithm)

This paper contains 67 sections, 7 equations, 4 figures, 18 tables, 1 algorithm.

Figures (4)

  • Figure 2: Overview of SimpleTool. Given an input prompt with tool definitions, parallel heads generate function name and arguments independently while sharing the prefix KV cache.
  • Figure 3: Token compression illustration. Baseline structured output contains $\sim$30 tokens spanning three entropy levels. SimpleTool compresses low-entropy tokens into special markers, retaining only high-entropy values for generation.
  • Figure 4: Average accuracy comparison between Baseline and SimpleTool across benchmark groups (macro average). BFCL-v3 is evaluated on single-turn subsets; Mobile Actions parallel calls are converted to multi-turn format; "Others" combines SealTools, OpenFunc, and ToolAlpaca.
  • Figure 5: Inference speedup evaluation. (a)--(b) Stacked bars show baseline time partitioned into SimpleTool time (dark) and time saved (light); speedup ratios labeled at boundary. (a) Transformers backend on RTX 4090 (blue) and H100 (green). (b) vLLM backend with prefix caching. (c) Absolute latency with AWQ quantization on RTX 4090. (d) Latency scaling with number of active heads (Qwen3-4B, AWQ in 4090). All measurements on BFCL-v3 single-turn cases.