Efficient Function-as-a-Service for Large Language Models with TIDAL
Weihao Cui, Ziyi Xu, Han Zhao, Quan Chen, Zijun Li, Bingsheng He, Minyi Guo
TL;DR
Tidal addresses GPU-side cold-start in Function-as-a-Service for Large Language Models by automatically tracing fine-grained execution paths during model initialization and inference. It then generates adaptive function templates that enable proactive code loading and adaptive state forking, overlapping model loading with computation and reusing static components while dynamically initializing request-specific parts. The approach yields significant improvements, reducing cold-start latency by up to about $1.79\times$ to $2.11\times$ and decreasing the $95\%$-ile TTFT by about $76.0\%$, across multiple LLMs and workload regimes, including distributed tensor-parallel deployments. This work demonstrates the practical viability of trace-driven template adaptation to achieve fast, robust GPU-backed LLM serving in a multi-tenant FaaS setting, with acceptable overhead and solid security considerations.
Abstract
Large Language Model (LLM) applications have emerged as a prominent use case for Function-as-a-Service (FaaS) due to their high computational demands and sporadic invocation patterns. However, serving LLM functions within FaaS frameworks faces significant GPU-side cold start. A fundamental approach involves leveraging a template with function state saved on GPUs to bypass the cold start for new invocations. Yet, this approach struggles with the high GPU footprint, dynamic initialization behaviors, and lazy GPU kernel loading inherent in LLM functions, primarily due to a lack of insight into the underlying execution details. In this paper, we introduce TIDAL, an optimized FaaS framework for LLM applications that achieves fast startups by tracing fine-grained execution paths. By utilizing the traced execution details, TIDAL generates adaptive function templates, effectively breaking startup barriers for LLM functions. Extensive evaluations demonstrate that TIDAL reduces cold start latency by $1.79\times\text{\textasciitilde}2.11\times$ and improves the $95\%$-ile time-to-first-token by $76.0\%$, surpassing state-of-the-art methods.
