Table of Contents
Fetching ...

Serverless Cold Starts and Where to Find Them

Artjom Joosen, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Luke Darlow, Jianfeng Wang, Qiwen Deng, Adam Barker

TL;DR

A month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform is analyzed, revealing the complexity and multifaceted origins of the number, duration, and characteristics of cold starts.

Abstract

This paper releases and analyzes a month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform. Our analysis spans workloads from five data centers. We focus on cold starts and provide a comprehensive examination of the underlying factors influencing the number and duration of cold starts. These factors include trigger types, request synchronicity, runtime languages, and function resource allocations. We investigate components of cold starts, including pod allocation time, code and dependency deployment time, and scheduling delays, and examine their relationships with runtime languages, trigger types, and resource allocation. We introduce pod utility ratio to measure the pod's useful lifetime relative to its cold start time, giving a more complete picture of cold starts, and see that some pods with long cold start times have longer useful lifetimes. Our findings reveal the complexity and multifaceted origins of the number, duration, and characteristics of cold starts, driven by differences in trigger types, runtime languages, and function resource allocations. For example, cold starts in Region 1 take up to 7 seconds, dominated by dependency deployment time and scheduling. In Region 2, cold starts take up to 3 seconds and are dominated by pod allocation time. Based on this, we identify opportunities to reduce the number and duration of cold starts using strategies for multi-region scheduling. Finally, we suggest directions for future research to address these challenges and enhance the performance of serverless cloud platforms. Our datasets and code are available here https://github.com/sir-lab/data-release

Serverless Cold Starts and Where to Find Them

TL;DR

A month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform is analyzed, revealing the complexity and multifaceted origins of the number, duration, and characteristics of cold starts.

Abstract

This paper releases and analyzes a month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform. Our analysis spans workloads from five data centers. We focus on cold starts and provide a comprehensive examination of the underlying factors influencing the number and duration of cold starts. These factors include trigger types, request synchronicity, runtime languages, and function resource allocations. We investigate components of cold starts, including pod allocation time, code and dependency deployment time, and scheduling delays, and examine their relationships with runtime languages, trigger types, and resource allocation. We introduce pod utility ratio to measure the pod's useful lifetime relative to its cold start time, giving a more complete picture of cold starts, and see that some pods with long cold start times have longer useful lifetimes. Our findings reveal the complexity and multifaceted origins of the number, duration, and characteristics of cold starts, driven by differences in trigger types, runtime languages, and function resource allocations. For example, cold starts in Region 1 take up to 7 seconds, dominated by dependency deployment time and scheduling. In Region 2, cold starts take up to 3 seconds and are dominated by pod allocation time. Based on this, we identify opportunities to reduce the number and duration of cold starts using strategies for multi-region scheduling. Finally, we suggest directions for future research to address these challenges and enhance the performance of serverless cloud platforms. Our datasets and code are available here https://github.com/sir-lab/data-release
Paper Structure (35 sections, 17 figures, 1 table)

This paper contains 35 sections, 17 figures, 1 table.

Figures (17)

  • Figure 1: Plot showing the number of requests, functions, and pods for all five regions.
  • Figure 2: Life cycle of a pod. The pod is taken from its resource pool, loaded with a runtime, function code, and dependencies. After serving requests, the pod waits for additional requests for a designated keep alive time. If the pod receives no more requests during this time, it is deleted.
  • Figure 3: CDF of invocations, execution time, and CPU usage for all regions.
  • Figure 4: Number of functions per user and number of requests per user for all regions over the duration of the trace.
  • Figure 5: Plot showing peaks in normalized number of requests per region. Peaks detected on a smoothed version of the signal. The largest peak in 24 hours is highlighted.
  • ...and 12 more figures