Table of Contents
Fetching ...

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica

TL;DR

This work tackles the high cost and reliability challenges of serving AI inference on spot GPUs by proposing SpotHedge, a policy that dynamically places spot replicas across multiple regions and clouds and maintains a flexible mix with on-demand replicas. The SkyServe system implements SpotHedge, providing provisioning, autoscaling, and load balancing to sustain availability and QoS while leveraging spot cost savings. Through end-to-end cloud experiments and trace-based simulations, the approach achieves significant improvements: approximately $43\%$ cost reductions and up to $2.3\times$ (P50) latency improvements compared to on-demand baselines, with consistently low failure rates. The practical impact is substantial, enabling economical, resilient multi-region AI serving and offering an open-source platform to foster further research in spot-based inference serving across clouds.

Abstract

Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we propose a simple yet efficient policy, SpotHedge, that leverages spot replicas across different failure domains (e.g., regions and clouds) to ensure availability, lower costs, and high service quality. SpotHedge intelligently spreads spot replicas across different regions and clouds to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We built SkyServe, a system leveraging SpotHedge to efficiently serve AI models over a mixture of spot and on-demand replicas across regions and clouds. We compared SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by 43% on average while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by 2.3$\times$, 2.1$\times$, 2.1$\times$ on average compared to other research and production systems.

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

TL;DR

This work tackles the high cost and reliability challenges of serving AI inference on spot GPUs by proposing SpotHedge, a policy that dynamically places spot replicas across multiple regions and clouds and maintains a flexible mix with on-demand replicas. The SkyServe system implements SpotHedge, providing provisioning, autoscaling, and load balancing to sustain availability and QoS while leveraging spot cost savings. Through end-to-end cloud experiments and trace-based simulations, the approach achieves significant improvements: approximately cost reductions and up to (P50) latency improvements compared to on-demand baselines, with consistently low failure rates. The practical impact is substantial, enabling economical, resilient multi-region AI serving and offering an open-source platform to foster further research in spot-based inference serving across clouds.

Abstract

Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we propose a simple yet efficient policy, SpotHedge, that leverages spot replicas across different failure domains (e.g., regions and clouds) to ensure availability, lower costs, and high service quality. SpotHedge intelligently spreads spot replicas across different regions and clouds to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We built SkyServe, a system leveraging SpotHedge to efficiently serve AI models over a mixture of spot and on-demand replicas across regions and clouds. We compared SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by 43% on average while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by 2.3, 2.1, 2.1 on average compared to other research and production systems.

Paper Structure

This paper contains 67 sections, 1 equation, 15 figures, 1 table, 1 algorithm.

Figures (15)

  • Figure 2: An AI service comprises of multiple model replicas; each replica is hosted on one or multiple instances. Each replica can independently serve user requests without communicating with other replicas.
  • Figure 3: (a) Correlated spot GPUs preemptions within the same region; (b) Lack of correlation across regions. Both (a) and (b) are from a 2-week trace (§\ref{['microbenchmark']}) for 4 p3.2xlarge (V100) instances. To collect the trace, we try to maintain the desired number of spot instances, record preemption, and replenish any preempted instances. Each vertical line indicates either preemption (from higher to lower) or a successful launch (from lower to higher). (c) Correlated preemptions across 8 zones in 3 regions on AWS for V100 GPU. Each cell shows the correlation between two zones indicated by the row and column labels. The values are Pearson Correlation (with $p < 0.01$), and we bold correlation $>= 0.3$. Intra-region has more correlation among {east-1a, east-1c, east-1d, east-1f}, {east-2a, east-2b}, {west-2a, west-2b} whereas there is little to no inter-region correlation.
  • Figure 4: Spot GPUs (p3.2xlarge) experience more preemptions than spot CPUs (c3-highcpu-176). Horizontal lines represent the available period. Vertical bars are changes from available to unavailable, followed by grey gaps indicating the unavailable period.
  • Figure 5: Service availability improves as the number of zones and regions considered increases. Fig \ref{['fig:avail-vs-zone-a100']} uses a 3-day trace for a2-ultragpu-4g in 6 zones and 5 regions. Fig \ref{['fig:avail-vs-zone-v100']} uses a 2-month trace for p3.2xlarge in 9 zones and 3 regions.
  • Figure 6: Latency Characteristics of AI Services. Fig. \ref{['fig:request-latency-breakdown']} measures the latency breakdown of a Vicuna-13B endpoint, serving a request with 20 input and 44 output tokens. Fig. \ref{['fig:network-latency']} measures round trip network latency between different regions of GCP.
  • ...and 10 more figures