Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo
TL;DR
Conveyor tackles the latency challenges of tool-assisted LLM serving by enabling tool partial execution that runs concurrently with LLM decoding. It introduces a token-level scheduler and a tool interface that lets tools expose partial-execution opportunities, enabling pipelined execution with decoding while using dedicated IPC for tool processes. Empirical results across six tool-augmented workloads show up to 38.8% latency reductions, with some cases achieving much larger improvements via early aborts, and a theoretical framework that explains when overlaps are most beneficial. The work demonstrates practical, low-overhead integration for tool-enabled LLM services and highlights the promising impact of co-optimizing decoding and external tool execution for responsive AI applications.
Abstract
The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.
