Table of Contents
Fetching ...

Pie: A Programmable Serving System for Emerging LLM Applications

In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong

TL;DR

Pie rethinks LLM serving by separating the generation loop from application logic and exposing fine-grained, API-driven handlers that inferlets orchestrate within a WebAssembly sandbox. The approach enables explicit KV cache control, customizable generation procedures, and seamless integration of arbitrary computation and I/O inside the generation flow. Empirical results show Pie matches state-of-the-art performance on standard text completion tasks and delivers substantial latency and throughput improvements for agentic workflows and advanced generation strategies, thanks to application-specific optimizations and adaptive batching. This programmability holds practical impact for diverse LLM applications, offering a flexible, scalable backend that can support evolving AI workflows while remaining performant and extensible in open-source deployments.

Abstract

Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

Pie: A Programmable Serving System for Emerging LLM Applications

TL;DR

Pie rethinks LLM serving by separating the generation loop from application logic and exposing fine-grained, API-driven handlers that inferlets orchestrate within a WebAssembly sandbox. The approach enables explicit KV cache control, customizable generation procedures, and seamless integration of arbitrary computation and I/O inside the generation flow. Empirical results show Pie matches state-of-the-art performance on standard text completion tasks and delivers substantial latency and throughput improvements for agentic workflows and advanced generation strategies, thanks to application-specific optimizations and adaptive batching. This programmability holds practical impact for diverse LLM applications, offering a flexible, scalable backend that can support evolving AI workflows while remaining performant and extensible in open-source deployments.

Abstract

Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

Paper Structure

This paper contains 27 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Current LLM serving systems conceptually follow a monolithic prefill-decode loop, batching prompts and applying global policies for KV cache. This design lacks flexibility to support application-specific logic.
  • Figure 2: Our proposed system, Pie, dismantles the sequential generation process into independent handlers, and delegates control to user-provided programs called inferlets.
  • Figure 3: Inferlet service workflow (§\ref{['sec:arch']}). The application layer executes inferlets that make API calls to the control layer whose batch scheduler adaptively batches these calls and forwards them to the inference layer. Results are sent back to the control layer and then to the inferlets.
  • Figure 4: Batch scheduling example (§\ref{['sec:batch_scheduler']}). A batch of two embed_txt API calls from command queues 3 and 4, or a batch of three forward API calls from queue 1 and 2, is eligible to be dispatched to the inference layer. Horizontal batching groups calls across different command queues, while vertical batching groups consecutive calls of the same type within the same queue if they do not conflict.
  • Figure 5: Implementation comparison of agentic workflows in vLLM/SGLang (left) vs. Pie (right).
  • ...and 6 more figures