Table of Contents
Fetching ...

Efficient Function-as-a-Service for Large Language Models with TIDAL

Weihao Cui, Ziyi Xu, Han Zhao, Quan Chen, Zijun Li, Bingsheng He, Minyi Guo

TL;DR

Tidal addresses GPU-side cold-start in Function-as-a-Service for Large Language Models by automatically tracing fine-grained execution paths during model initialization and inference. It then generates adaptive function templates that enable proactive code loading and adaptive state forking, overlapping model loading with computation and reusing static components while dynamically initializing request-specific parts. The approach yields significant improvements, reducing cold-start latency by up to about $1.79\times$ to $2.11\times$ and decreasing the $95\%$-ile TTFT by about $76.0\%$, across multiple LLMs and workload regimes, including distributed tensor-parallel deployments. This work demonstrates the practical viability of trace-driven template adaptation to achieve fast, robust GPU-backed LLM serving in a multi-tenant FaaS setting, with acceptable overhead and solid security considerations.

Abstract

Large Language Model (LLM) applications have emerged as a prominent use case for Function-as-a-Service (FaaS) due to their high computational demands and sporadic invocation patterns. However, serving LLM functions within FaaS frameworks faces significant GPU-side cold start. A fundamental approach involves leveraging a template with function state saved on GPUs to bypass the cold start for new invocations. Yet, this approach struggles with the high GPU footprint, dynamic initialization behaviors, and lazy GPU kernel loading inherent in LLM functions, primarily due to a lack of insight into the underlying execution details. In this paper, we introduce TIDAL, an optimized FaaS framework for LLM applications that achieves fast startups by tracing fine-grained execution paths. By utilizing the traced execution details, TIDAL generates adaptive function templates, effectively breaking startup barriers for LLM functions. Extensive evaluations demonstrate that TIDAL reduces cold start latency by $1.79\times\text{\textasciitilde}2.11\times$ and improves the $95\%$-ile time-to-first-token by $76.0\%$, surpassing state-of-the-art methods.

Efficient Function-as-a-Service for Large Language Models with TIDAL

TL;DR

Tidal addresses GPU-side cold-start in Function-as-a-Service for Large Language Models by automatically tracing fine-grained execution paths during model initialization and inference. It then generates adaptive function templates that enable proactive code loading and adaptive state forking, overlapping model loading with computation and reusing static components while dynamically initializing request-specific parts. The approach yields significant improvements, reducing cold-start latency by up to about to and decreasing the -ile TTFT by about , across multiple LLMs and workload regimes, including distributed tensor-parallel deployments. This work demonstrates the practical viability of trace-driven template adaptation to achieve fast, robust GPU-backed LLM serving in a multi-tenant FaaS setting, with acceptable overhead and solid security considerations.

Abstract

Large Language Model (LLM) applications have emerged as a prominent use case for Function-as-a-Service (FaaS) due to their high computational demands and sporadic invocation patterns. However, serving LLM functions within FaaS frameworks faces significant GPU-side cold start. A fundamental approach involves leveraging a template with function state saved on GPUs to bypass the cold start for new invocations. Yet, this approach struggles with the high GPU footprint, dynamic initialization behaviors, and lazy GPU kernel loading inherent in LLM functions, primarily due to a lack of insight into the underlying execution details. In this paper, we introduce TIDAL, an optimized FaaS framework for LLM applications that achieves fast startups by tracing fine-grained execution paths. By utilizing the traced execution details, TIDAL generates adaptive function templates, effectively breaking startup barriers for LLM functions. Extensive evaluations demonstrate that TIDAL reduces cold start latency by and improves the -ile time-to-first-token by , surpassing state-of-the-art methods.

Paper Structure

This paper contains 55 sections, 1 equation, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Cold-start invocation using Llama2-13BtouvronLlama2 on an Nvidia RTX A6000. The input length is 2k.
  • Figure 2: An example of an LLM function encapsulating LLaMA 2-13B with two parts: initialization and handler.
  • Figure 3: Lifecycle of a cold-start invocation using Llama2-13B and highlighting the optimizing targets of Tidal.
  • Figure 4: Breakdown of GPU cold start and fully-warmed invocation latencies for 2 Llama-family models touvronLlama2dubeyLlama3 with varied inputs. For instance, "13B-512" denotes a Llama with 13 billions parameters evaluated using an input length of 512.
  • Figure 5: Strawman solution based on CPU-only template-start, implemented via CUDA IPC.
  • ...and 15 more figures