Table of Contents
Fetching ...

Serve Programs, Not Prompts

In Gim, Lin Zhong

TL;DR

A new LLM serving system architecture that serves programs instead of prompts that allows users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server.

Abstract

Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.

Serve Programs, Not Prompts

TL;DR

A new LLM serving system architecture that serves programs instead of prompts that allows users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server.

Abstract

Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.

Paper Structure

This paper contains 15 sections, 3 figures.

Figures (3)

  • Figure 1: Comparison with existing serving systems (top) and Symphony (bottom). Symphony serves as an operating system for user-defined inference programs.
  • Figure 2: Example program demonstrating parallel token generation with shared prefix KV cache.
  • Figure 3: Estimated performance benefits of prompt caching implemented via LIPs in Symphony. The figure shows normalized mean end-to-end latency per generated token and throughput using the Llama dubey2024llama 13B model on NVIDIA A100 GPU. Symphony enables application-specific LLM optimizations, such as caching frequently reused KV cache, without requiring modifications to the serving system design.