Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads
Grant Wilkins, Srinivasan Keshav, Richard Mortier
TL;DR
The paper addresses the high energy cost of LLM inference in data centers by introducing a cost-based scheduling framework for hybrid, heterogeneous hardware. By using a token-based policy that assigns small-tokens tasks to energy-efficient CPUs (e.g., M1 Pro) and larger-tokens tasks to high-performance GPUs (e.g., A100), it demonstrates a real-time, data-driven approach to minimize energy while balancing runtime. A key result is a reported 7.5% energy reduction over workload-unaware baselines, achieved with thresholds that separate input and output token workloads. The authors also publish a dataset and benchmark suite to evaluate energy efficiency in LLM inference, enabling broader adoption and further optimization of sustainable AI infrastructure.
Abstract
Both the training and use of Large Language Models (LLMs) require large amounts of energy. Their increasing popularity, therefore, raises critical concerns regarding the energy efficiency and sustainability of data centers that host them. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate LLM tasks across hardware accelerators that differ in their energy efficiencies and computational capabilities. Specifically, our workload-aware strategy determines whether tasks are processed on energy-efficient processors or high-performance GPUs based on the number of input and output tokens in a query. Our analysis of a representative LLM dataset, finds that this hybrid strategy can reduce CPU+GPU energy consumption by 7.5% compared to a workload-unaware baseline.
