Pie: A Programmable Serving System for Emerging LLM Applications
In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong
TL;DR
Pie rethinks LLM serving by separating the generation loop from application logic and exposing fine-grained, API-driven handlers that inferlets orchestrate within a WebAssembly sandbox. The approach enables explicit KV cache control, customizable generation procedures, and seamless integration of arbitrary computation and I/O inside the generation flow. Empirical results show Pie matches state-of-the-art performance on standard text completion tasks and delivers substantial latency and throughput improvements for agentic workflows and advanced generation strategies, thanks to application-specific optimizations and adaptive batching. This programmability holds practical impact for diverse LLM applications, offering a flexible, scalable backend that can support evolving AI workflows while remaining performant and extensible in open-source deployments.
Abstract
Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
