The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
Yunho Jin, Gu-Yeon Wei, David Brooks
TL;DR
The paper investigates the energy costs of test-time compute (TTC) as a complementary strategy to traditional model scaling for LLM inference. By systematically comparing two TTC approaches—parallel sampling with majority vote (MV) and reasoning tokens (RT)—to baseline scaling across Qwen2.5 models on Math, Code, and Commonsense benchmarks, it reveals that RT frequently achieves a superior accuracy-per-energy frontier, especially on reasoning-intensive tasks, while MV often incurs energy penalties with limited gains. The study also shows that output length and query difficulty strongly influence energy usage, and that system-level optimizations (e.g., prefix caching, speculative decoding) can modulate TTC payoffs. Collectively, these results point to practical, difficulty-aware, and length-aware strategies (such as length-wise early exit) for deploying sustainable, accurate LLMs without solely increasing model size.
Abstract
Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work explores how test-time compute (TTC) can serve as an energy-efficient complement to conventional scaling strategies by allocating additional computational resources at inference time rather than during training. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models.
