Table of Contents
Fetching ...

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic

TL;DR

It is shown that lightweight semantic features predict inference difficulty better than input length, and reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase.

Abstract

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with an upper-bound analysis of the potential benefits of combining workload-aware model selection with phase-aware DVFS, motivating future energy-efficient LLM inference systems.

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

TL;DR

It is shown that lightweight semantic features predict inference difficulty better than input length, and reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase.

Abstract

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with an upper-bound analysis of the potential benefits of combining workload-aware model selection with phase-aware DVFS, motivating future energy-efficient LLM inference systems.
Paper Structure (40 sections, 7 figures, 18 tables)

This paper contains 40 sections, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Study workflow.
  • Figure 2: Input length vs quality score. The near-zero correlation ($r = 0.002$) demonstrates that length alone cannot predict query difficulty. Easy/hard labels indicate whether normalized mean quality across models exceeds 0.5.
  • Figure 3: Energy consumption per generated token across GPU frequencies. Lower frequencies achieve better energy efficiency (higher tokens per joule) due to the memory-bound nature of the decode phase.
  • Figure 4: The frequency cliff: energy savings plateau below $\sim$1000 MHz. All models achieve 40--45% savings in the plateau region, with diminishing returns at lower frequencies. The optimal operating point lies at $\sim$960 MHz where energy savings are maximized without significant latency penalty.
  • Figure 5: Effect of batch size on DVFS effectiveness. Energy savings remain consistent (42--44%) across all batch sizes. Latency penalties decrease with larger batches as prefill overhead is amortized over more tokens.
  • ...and 2 more figures