Table of Contents
Fetching ...

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo

TL;DR

Conveyor tackles the latency challenges of tool-assisted LLM serving by enabling tool partial execution that runs concurrently with LLM decoding. It introduces a token-level scheduler and a tool interface that lets tools expose partial-execution opportunities, enabling pipelined execution with decoding while using dedicated IPC for tool processes. Empirical results across six tool-augmented workloads show up to 38.8% latency reductions, with some cases achieving much larger improvements via early aborts, and a theoretical framework that explains when overlaps are most beneficial. The work demonstrates practical, low-overhead integration for tool-enabled LLM services and highlights the promising impact of co-optimizing decoding and external tool execution for responsive AI applications.

Abstract

The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

TL;DR

Conveyor tackles the latency challenges of tool-assisted LLM serving by enabling tool partial execution that runs concurrently with LLM decoding. It introduces a token-level scheduler and a tool interface that lets tools expose partial-execution opportunities, enabling pipelined execution with decoding while using dedicated IPC for tool processes. Empirical results across six tool-augmented workloads show up to 38.8% latency reductions, with some cases achieving much larger improvements via early aborts, and a theoretical framework that explains when overlaps are most beneficial. The work demonstrates practical, low-overhead integration for tool-enabled LLM services and highlights the promising impact of co-optimizing decoding and external tool execution for responsive AI applications.

Abstract

The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.
Paper Structure (20 sections, 2 equations, 8 figures)

This paper contains 20 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: An example of tool-assisted LLM serving scenarios with and without tool partial execution optimization. This example includes three rounds of LLM inference (blue and green blocks) and two rounds of tool invocation (gray blocks).
  • Figure 2: A tool-assisted LLM serving scenario.
  • Figure 3: Python code generated by the LLM.
  • Figure 4: Case #1: Execution timeline for the CodeGen workload with and without partial execution. The numbers in the diagram represent the line number of code in \ref{['code:eval-sra-python']}. The length of each block represents the relative execution time but does not correspond to exact duration due to the expressiveness constraints in the diagram.
  • Figure 5: Conveyor workflow overview.
  • ...and 3 more figures