Table of Contents
Fetching ...

The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective

Jiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu

TL;DR

This work provides the first system-level analysis of AI agents powered by LLMs, quantifying how dynamic, tool-augmented reasoning drives substantial compute, memory, and energy costs beyond static inference. It reveals that while deeper or broader reasoning improves accuracy, returns quickly diminish and latency becomes highly variable, threatening practical deployment at scale. Through empirical evaluation of representative agents and benchmarks, the study highlights bottlenecks in sequential LLM-tool interactions, memory growth from history tokens, and underutilization of GPUs, and demonstrates the potential of prefix caching and inter-request parallelism to mitigate some costs. The findings advocate for compute-efficient agent designs, smarter resource management, and architecture-level co-design to achieve sustainable, scalable AI agents in real-world infrastructure.

Abstract

Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.

The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective

TL;DR

This work provides the first system-level analysis of AI agents powered by LLMs, quantifying how dynamic, tool-augmented reasoning drives substantial compute, memory, and energy costs beyond static inference. It reveals that while deeper or broader reasoning improves accuracy, returns quickly diminish and latency becomes highly variable, threatening practical deployment at scale. Through empirical evaluation of representative agents and benchmarks, the study highlights bottlenecks in sequential LLM-tool interactions, memory growth from history tokens, and underutilization of GPUs, and demonstrates the potential of prefix caching and inter-request parallelism to mitigate some costs. The findings advocate for compute-efficient agent designs, smarter resource management, and architecture-level co-design to achieve sustainable, scalable AI agents in real-world infrastructure.

Abstract

Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.

Paper Structure

This paper contains 17 sections, 23 figures, 3 tables.

Figures (23)

  • Figure 1: Overview of test‑time scaling. (a) Conventional LLMs map inputs directly to outputs in a single forward pass, with no explicit intermediate reasoning. (b) Reasoning‑enhanced LLMs internally create intermediate steps—sampling alternative responses or extending token sequences—to deepen or diversify their thought process. (c) AI agents augment this reasoning by planning and invoking external tools, observing the outcomes and adapting their internal reasoning accordingly, and iteratively refining their decision-making until they generate the final answer.
  • Figure 2: Overview of AI agent structure.
  • Figure 3: Execution timeline of each AI agent.
  • Figure 4: Average number of LLM and tool invocations per request.
  • Figure 5: Latency breakdown of agents (left axis, bar graph) and their end-to-end latency for processing a single request (right axis, diamond marker). The pink bars represent phases where LLM and tool execution latencies overlap, as observed in LLMCompiler, which asynchronously executes tools during plan generation.
  • ...and 18 more figures