DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic; Chaojie Zhang; Íñigo Goiri; Josep Torrellas; Esha Choukse

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse

TL;DR

The paper tackles the high energy and carbon footprint of LLM inference clusters by introducing DynamoLLM, an automatic framework that dynamically reconfigures cluster organization to optimize energy and cost while satisfying latency SLOs. It leverages heterogeneous energy-performance profiles, per-request-type pools, and predictive scheduling within a hierarchical control architecture to adapt to dynamic workloads and mitigate reconfiguration overheads. Through profiling-based energy models and MILP optimization (with practical approximations), DynamoLLM demonstrates substantial real-world savings on production traces across large GPU clusters, including significant reductions in energy, carbon emissions, and operational costs. The work advances practical, scalable energy-aware LLM serving by integrating profiling, prediction, and overhead-sensitive reconfiguration into a cohesive system.

Abstract

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 16 figures, 5 tables)

This paper contains 21 sections, 1 equation, 16 figures, 5 tables.

Introduction
Background
Opportunities for Energy Efficiency
Heterogeneous Energy-Performance Profiles
Dynamic LLM Inference Workloads
Reconfiguration Overheads
DynamoLLM: An Energy Management Framework for LLM Inference Clusters
Configuring Instances for Energy-Efficiency
Hierarchical Control for Dynamic Load
Reduced Overheads for Smooth Reconfiguration
Predictive Scheduling for Request Heterogeneity
DynamoLLM Implementation
Evaluation
Evaluation Setup
Cluster-Level Experiments
...and 6 more sections

Figures (16)

Figure 1: Distribution of requests based on input and output lengths categorized into three groups: short, medium, and long.
Figure 2: Load over a week for Coding and Conversation LLM inference workloads.
Figure 3: Throughput for different request types with constant frequency (1980MHz) and with re-setting the frequency (to 1980MHz) on every iteration in the background.
Figure 4: DynamoLLM architecture: a hierarchy of controllers with cluster resources split into per request-type pools.
Figure 5: Example of re-sharding a TP4 model to TP2/TP8.
...and 11 more figures

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

TL;DR

Abstract

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Authors

TL;DR

Abstract

Table of Contents

Figures (16)