Table of Contents
Fetching ...

The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute

Yunho Jin, Gu-Yeon Wei, David Brooks

TL;DR

The paper investigates the energy costs of test-time compute (TTC) as a complementary strategy to traditional model scaling for LLM inference. By systematically comparing two TTC approaches—parallel sampling with majority vote (MV) and reasoning tokens (RT)—to baseline scaling across Qwen2.5 models on Math, Code, and Commonsense benchmarks, it reveals that RT frequently achieves a superior accuracy-per-energy frontier, especially on reasoning-intensive tasks, while MV often incurs energy penalties with limited gains. The study also shows that output length and query difficulty strongly influence energy usage, and that system-level optimizations (e.g., prefix caching, speculative decoding) can modulate TTC payoffs. Collectively, these results point to practical, difficulty-aware, and length-aware strategies (such as length-wise early exit) for deploying sustainable, accurate LLMs without solely increasing model size.

Abstract

Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models.

The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute

TL;DR

The paper investigates the energy costs of test-time compute (TTC) as a complementary strategy to traditional model scaling for LLM inference. By systematically comparing two TTC approaches—parallel sampling with majority vote (MV) and reasoning tokens (RT)—to baseline scaling across Qwen2.5 models on Math, Code, and Commonsense benchmarks, it reveals that RT frequently achieves a superior accuracy-per-energy frontier, especially on reasoning-intensive tasks, while MV often incurs energy penalties with limited gains. The study also shows that output length and query difficulty strongly influence energy usage, and that system-level optimizations (e.g., prefix caching, speculative decoding) can modulate TTC payoffs. Collectively, these results point to practical, difficulty-aware, and length-aware strategies (such as length-wise early exit) for deploying sustainable, accurate LLMs without solely increasing model size.

Abstract

Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Accuracy versus energy per query averaged across four benchmarks in each task. Each dot in the line represent Qwen2.5 1.5B, 7B, 14B, and 32B from left to right, respectively.
  • Figure 2: Energy consumption of MV and RT normalized to Base which does not use TTC. Left bars in each color represent MV and right bars represent RT. The first, second, and third sets of four benchmarks represent math, code, and common sense, respectively. The dotted horizontal line represents Base. Note that the y-axis is cut off from 20 to 110.
  • Figure 3: Power readings during runtime averaged across four benchmarks in each task.
  • Figure 4: Energy vs Accuracy per length. Dotted and solid lines represent Base and RT, respectively and each color represent different model size. Each of the ten dots on a line represent output sequence length limit starting from one-tenth of the maximum sequence length of a model to the maximum sequence length. Grey lines on the top represent the best models at the time of writing.
  • Figure 5: Output token count distribution of correct and incorrect queries.
  • ...and 2 more figures