Table of Contents
Fetching ...

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

Grant Wilkins, Srinivasan Keshav, Richard Mortier

TL;DR

The paper addresses the high energy cost of LLM inference in data centers by introducing a cost-based scheduling framework for hybrid, heterogeneous hardware. By using a token-based policy that assigns small-tokens tasks to energy-efficient CPUs (e.g., M1 Pro) and larger-tokens tasks to high-performance GPUs (e.g., A100), it demonstrates a real-time, data-driven approach to minimize energy while balancing runtime. A key result is a reported 7.5% energy reduction over workload-unaware baselines, achieved with thresholds that separate input and output token workloads. The authors also publish a dataset and benchmark suite to evaluate energy efficiency in LLM inference, enabling broader adoption and further optimization of sustainable AI infrastructure.

Abstract

Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate LLM tasks across hardware accelerators that differ in their energy efficiencies and computational capabilities. Specifically, our workload-aware strategy determines whether tasks are processed on energy-efficient processors or high-performance GPUs based on the number of input and output tokens in a query. Our analysis of a representative LLM dataset, finds that this hybrid strategy can reduce CPU+GPU energy consumption by 7.5% compared to a workload-unaware baseline.

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

TL;DR

The paper addresses the high energy cost of LLM inference in data centers by introducing a cost-based scheduling framework for hybrid, heterogeneous hardware. By using a token-based policy that assigns small-tokens tasks to energy-efficient CPUs (e.g., M1 Pro) and larger-tokens tasks to high-performance GPUs (e.g., A100), it demonstrates a real-time, data-driven approach to minimize energy while balancing runtime. A key result is a reported 7.5% energy reduction over workload-unaware baselines, achieved with thresholds that separate input and output token workloads. The authors also publish a dataset and benchmark suite to evaluate energy efficiency in LLM inference, enabling broader adoption and further optimization of sustainable AI infrastructure.

Abstract

Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate LLM tasks across hardware accelerators that differ in their energy efficiencies and computational capabilities. Specifically, our workload-aware strategy determines whether tasks are processed on energy-efficient processors or high-performance GPUs based on the number of input and output tokens in a query. Our analysis of a representative LLM dataset, finds that this hybrid strategy can reduce CPU+GPU energy consumption by 7.5% compared to a workload-unaware baseline.
Paper Structure (33 sections, 8 equations, 5 figures, 1 table)

This paper contains 33 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Performance of Various Systems and Models for Processing Variable Input Tokens--Due to the low variance in the data, error bars are too small to be visible.
  • Figure 2: Performance of Various Systems and Models for Processing Variable Output Tokens--Missing data points in M1-Pro and Palmetto Intel+V100 are due to CUDA out of memory errors. Due to the low variance in the data, error bars are too small to be visible.
  • Figure 3: Distribution of Token Counts for Alpaca alpaca
  • Figure 4: Performance of Hybrid Datacenter for Input Tokens Processing Alpaca--Dashed line shows the value for using only one kind of hardware for inference
  • Figure 5: Performance of Hybrid Datacenter for Output Tokens Processing Alpaca -- Dashed line shows the value for using only one kind of hardware for inference