Table of Contents
Fetching ...

Dirigent: Lightweight Serverless Orchestration

Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic

TL;DR

Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles, is proposed, which optimizes internal cluster manager abstractions to simplify state management and runs monolithic control and data planes to minimize internal communication overheads and maximize throughput.

Abstract

While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. The current approach of building FaaS cluster managers on top of legacy orchestration systems (e.g., Kubernetes) leads to high scheduling delays when clusters experience high sandbox churn, which is common for FaaS. Generic cluster managers use many hierarchical abstractions and internal components to manage and reconcile cluster state with frequent persistent updates. This becomes a bottleneck for FaaS since the cluster state frequently changes as sandboxes are created on the critical path of requests. Based on our root cause analysis of performance issues in existing FaaS cluster managers, we propose Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles. First, Dirigent optimizes internal cluster manager abstractions to simplify state management. Second, it eliminates persistent state updates on the critical path of function invocations, leveraging the fact that FaaS abstracts sandbox locations from users to relax exact state reconstruction guarantees. Finally, Dirigent runs monolithic control and data planes to minimize internal communication overheads and maximize throughput. We compare Dirigent to state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile per-function scheduling latency for a production workload by 2.79x compared to AWS Lambda. Dirigent can spin up 2500 sandboxes per second at low latency, which is 1250x more than Knative.

Dirigent: Lightweight Serverless Orchestration

TL;DR

Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles, is proposed, which optimizes internal cluster manager abstractions to simplify state management and runs monolithic control and data planes to minimize internal communication overheads and maximize throughput.

Abstract

While Function as a Service (FaaS) platforms can initialize function sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule functions in real FaaS clusters can be orders of magnitude higher. The current approach of building FaaS cluster managers on top of legacy orchestration systems (e.g., Kubernetes) leads to high scheduling delays when clusters experience high sandbox churn, which is common for FaaS. Generic cluster managers use many hierarchical abstractions and internal components to manage and reconcile cluster state with frequent persistent updates. This becomes a bottleneck for FaaS since the cluster state frequently changes as sandboxes are created on the critical path of requests. Based on our root cause analysis of performance issues in existing FaaS cluster managers, we propose Dirigent, a clean-slate system architecture for FaaS orchestration with three key principles. First, Dirigent optimizes internal cluster manager abstractions to simplify state management. Second, it eliminates persistent state updates on the critical path of function invocations, leveraging the fact that FaaS abstracts sandbox locations from users to relax exact state reconstruction guarantees. Finally, Dirigent runs monolithic control and data planes to minimize internal communication overheads and maximize throughput. We compare Dirigent to state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile per-function scheduling latency for a production workload by 2.79x compared to AWS Lambda. Dirigent can spin up 2500 sandboxes per second at low latency, which is 1250x more than Knative.
Paper Structure (24 sections, 11 figures, 3 tables)

This paper contains 24 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: End-to-end latency breakdown of cold invocation bursts in Knative. Sandbox creation involves sequentially creating two containers: user-code container and its sidecar. Sandbox init is the time it takes to pass health probes.
  • Figure 2: AWS Lambda end-to-end latency CDFs with different cold start bursts of hello-world functions. We pre-cache container images, based on insights from Brooker et al. brooker:firecracker_snapshots.
  • Figure 3: Rate of sandbox creation over time in a 30-minute window (after 10-min warmup) of the 70K function Azure trace shahrad:serverless, simulated on a 1000 worker-node cluster with default Knative scheduling policies. Each sandbox processes 1 request at a time, the default for FaaS platforms aws:sandbox_concurrencygcf:invocation_level_guarantees.
  • Figure 4: Knative system architecture, which builds on K8s. This diagram is simplified, showing only key components which all run as independent microservices. K8s components are blue, while yellow components are added by Knative.
  • Figure 5: CDF of per-invocation scheduling latency and per-function mean scheduling latency when executing 500-function Azure trace ustiugov:in_vitroshahrad:serverless on a 93-worker cluster.
  • ...and 6 more figures