Table of Contents
Fetching ...

TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Haoran Qiu, Rodrigo Fonseca, Josep Torrellas, Ricardo Bianchini

TL;DR

LLM inference in cloud datacenters faces tight thermal and power constraints due to millisecond-scale execution and co-located SaaS/IaaS workloads. TAPAS offers a three-pronged approach—VM placement, request routing, and instance configuration—leveraging historical temperature and power data to maximize cooling and power oversubscription while preserving SaaS performance. The system demonstrates substantial gains: P99 latency is maintained, maximum temperatures drop by up to 17% and peak row power by 23%, and oversubscription capacity increases by up to 40%, with large-scale reductions in throttling events (up to 97% thermal and 99% power throttling avoided) and resilience to failures. Together, these results indicate that TAPAS can materially reduce cloud datacenter TCO and improve the robustness of LLM serving under varying load and emergency conditions.

Abstract

The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.

TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

TL;DR

LLM inference in cloud datacenters faces tight thermal and power constraints due to millisecond-scale execution and co-located SaaS/IaaS workloads. TAPAS offers a three-pronged approach—VM placement, request routing, and instance configuration—leveraging historical temperature and power data to maximize cooling and power oversubscription while preserving SaaS performance. The system demonstrates substantial gains: P99 latency is maintained, maximum temperatures drop by up to 17% and peak row power by 23%, and oversubscription capacity increases by up to 40%, with large-scale reductions in throttling events (up to 97% thermal and 99% power throttling avoided) and resilience to failures. Together, these results indicate that TAPAS can materially reduce cloud datacenter TCO and improve the robustness of LLM serving under varying load and emergency conditions.

Abstract

The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
Paper Structure (21 sections, 4 equations, 21 figures, 2 tables)

This paper contains 21 sections, 4 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Sample datacenter layout illustrating 80 racks organized into 8 rows with 4 cold aisles. The rack color represents the inlet temperatures for the top server.
  • Figure 2: Inlet and outside temperatures for three servers throughout August 2024.
  • Figure 3: Regression analysis comparing inlet and outside temperatures for three sample servers. Includes actual measurements for Server 3.
  • Figure 4: Inlet temperature distribution across physical entities: rows, racks within rows, and height within racks.
  • Figure 5: Inlet temperature as a function of datacenter load and outside temperature. It includes actual measurements and regression lines per power load levels.
  • ...and 16 more figures