Table of Contents
Fetching ...

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas

TL;DR

LLM inference imposes rising energy demands in data centers. The authors characterize energy-efficiency knobs—workload type, batching, and tensor/pipeline parallelism—and assess their interaction with GPU frequency scaling using Llama-2 70B on a DGX-H100 with vLLM, measuring TTFT, TBT, throughput, power, and energy. Key contributions include a detailed map of performance-energy trade-offs across workload types and parallelism configurations, with findings that energy savings (around 20%) are achievable without hurting latency or throughput in many cases, and that workload-aware choices can yield further gains. This work informs energy-aware orchestration and scheduling for greener, scalable LLM deployment in data centers.

Abstract

With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

TL;DR

LLM inference imposes rising energy demands in data centers. The authors characterize energy-efficiency knobs—workload type, batching, and tensor/pipeline parallelism—and assess their interaction with GPU frequency scaling using Llama-2 70B on a DGX-H100 with vLLM, measuring TTFT, TBT, throughput, power, and energy. Key contributions include a detailed map of performance-energy trade-offs across workload types and parallelism configurations, with findings that energy savings (around 20%) are achievable without hurting latency or throughput in many cases, and that workload-aware choices can yield further gains. This work informs energy-aware orchestration and scheduling for greener, scalable LLM deployment in data centers.

Abstract

With the ubiquitous use of modern large language models (LLMs) across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.
Paper Structure (13 sections, 15 figures)

This paper contains 13 sections, 15 figures.

Figures (15)

  • Figure 1: Normalized TTFT varying GPU frequencies for different inputs/outputs.
  • Figure 2: Normalized TBT varying GPU frequencies for different inputs/outputs.
  • Figure 3: Maximum throughput of an 8-way tensor-parallel GPU LLama2 instance with different GPU frequencies for different input/output types.
  • Figure 4: Normalized power consumption of an 8-way tensor-parallel GPU LLama2 instance with different GPU frequencies for different request types.
  • Figure 5: Normalized energy consumption of an 8-way tensor-parallel GPU LLama2 instance with different GPU frequencies for different request types.
  • ...and 10 more figures