Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Paul Joe Maliakel; Shashikant Ilager; Ivona Brandic

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic

TL;DR

It is shown that lightweight semantic features predict inference difficulty better than input length, and reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase.

Abstract

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with an upper-bound analysis of the potential benefits of combining workload-aware model selection with phase-aware DVFS, motivating future energy-efficient LLM inference systems.

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

TL;DR

Abstract

Paper Structure (40 sections, 7 figures, 18 tables)

This paper contains 40 sections, 7 figures, 18 tables.

Introduction
Background
Transformer-based LLM Inference
Prefill and Decode Phases
GPU DVFS and Energy--Performance Tradeoffs
Workload Heterogeneity in LLM Inference
Related Work
Design and Methodology
Design
Testbed and Measurement Infrastructure
Models
Datasets and Metrics
Workload Characterization
Motivation and Scope
Input Length and Structural Properties
...and 25 more sections

Figures (7)

Figure 1: Study workflow.
Figure 2: Input length vs quality score. The near-zero correlation ($r = 0.002$) demonstrates that length alone cannot predict query difficulty. Easy/hard labels indicate whether normalized mean quality across models exceeds 0.5.
Figure 3: Energy consumption per generated token across GPU frequencies. Lower frequencies achieve better energy efficiency (higher tokens per joule) due to the memory-bound nature of the decode phase.
Figure 4: The frequency cliff: energy savings plateau below $\sim$1000 MHz. All models achieve 40--45% savings in the plateau region, with diminishing returns at lower frequencies. The optimal operating point lies at $\sim$960 MHz where energy savings are maximized without significant latency penalty.
Figure 5: Effect of batch size on DVFS effectiveness. Energy savings remain consistent (42--44%) across all batch sizes. Latency penalties decrease with larger batches as prefill overhead is amortized over more tokens.
...and 2 more figures

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

TL;DR

Abstract

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (7)