Table of Contents
Fetching ...

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, Simran Arora

TL;DR

ThunderAgent tackles throughput and memory inefficiencies in large-scale agentic inference by introducing a program-centric abstraction that unifies KV-cache management and tool environments. The core approach combines a program-aware scheduler with a global waiting queue and lifecycle-aware tool resource management, underpinned by a Space-Time Product (STP) cost model where $ \text{Cost}_{\text{total}} \approx \text{Cost}_{\text{decode}} + \text{Cost}_{\text{prefill}} + \text{Cost}_{\text{recompute}} + \text{Cost}_{\text{unused}} + \text{Cost}_{\text{caching}}$ and $\text{Cost}_{\text{recompute}} \propto c_i^2$. ThunderAgent detects and mitigates KV-cache thrashing via periodic memory checks and a Shortest-First Eviction strategy, while a global program-aware waiting queue addresses cross-node memory imbalance; asynchronous environment preparation and hook-based garbage collection reduce tool-resource overhead and leakage. Empirical results show 1.5–3.6x throughput gains in serving, 1.8–3.9x in RL rollout, and up to 4.2x disk-memory savings across diverse agent workloads, validating the practicality and scalability of end-to-end program-aware optimization. The open-source release at https://github.com/Agentic-Kinetics/ThunderAgent enables reproducibility and further development of program-centric agent inference systems.

Abstract

Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.

ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System

TL;DR

ThunderAgent tackles throughput and memory inefficiencies in large-scale agentic inference by introducing a program-centric abstraction that unifies KV-cache management and tool environments. The core approach combines a program-aware scheduler with a global waiting queue and lifecycle-aware tool resource management, underpinned by a Space-Time Product (STP) cost model where and . ThunderAgent detects and mitigates KV-cache thrashing via periodic memory checks and a Shortest-First Eviction strategy, while a global program-aware waiting queue addresses cross-node memory imbalance; asynchronous environment preparation and hook-based garbage collection reduce tool-resource overhead and leakage. Empirical results show 1.5–3.6x throughput gains in serving, 1.8–3.9x in RL rollout, and up to 4.2x disk-memory savings across diverse agent workloads, validating the practicality and scalability of end-to-end program-aware optimization. The open-source release at https://github.com/Agentic-Kinetics/ThunderAgent enables reproducibility and further development of program-centric agent inference systems.

Abstract

Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.
Paper Structure (56 sections, 2 theorems, 20 equations, 10 figures, 5 tables)

This paper contains 56 sections, 2 theorems, 20 equations, 10 figures, 5 tables.

Key Result

Lemma 4.1

Given a program $P_i$ with context length $c_i$, the recomputation cost incurred by reprefilling its KV cache scales quadratically with $c_i$, i.e.,

Figures (10)

  • Figure 1: Performance comparison of ThunderAgent against prior agent inference systems as the parallel workflow number (i.e., batch size) increases. We evaluate the GLM-4.6 MoE model serving SWE-Agent on SWE-Bench Lite (Figures a and b) and SWE-Agent, OpenHands, and ToolOrchestra (Figure c) on an 8$\times$H100 GPU cluster. Results show that: (a) Current inference systems fail to maintain high throughput at large batch sizes. (b) Throughput degradation is primarily caused by low KV cache hit rates, which increase end-to-end request latency. (c) ThunderAgent achieves high throughput compared to prior inference systems by reducing KV-cache thrashing and managing the lifecycle of tool execution resources.
  • Figure 2: Demonstrations of the memory imbalance and tool resource management problems for current agentic inference systems. We evaluate vLLM + Kubernetes on OpenHands RL rollout using the GLM 4.6 model on SWEBench-Lite with two 8$\times$H100 GPU Nodes. The observations show: (a) Max memory imbalance can achieve 51% on 90 min rollout tests when applying vLLM KV-aware router. (b) Failure to garbage collect tool execution environments gradually causes resource usage to exceed system capacity. (c) Average tool execution environment preparation time grows fast as parallel workflow number increases.
  • Figure 3: An Overview of ThunderAgent. We show the transition between scheduling states and memory management. ThunderAgent queries the state of each data parallel backend periodically every $\Delta t$ time. Here, Backend #1 triggers thrashing, while Backend #3 is underutilized. The global waiting queue shared by all Backends then pauses and collects acting Program #2 back to the queue while releasing reasoning Program #6 and #9, to stop the KV-cache thrashing in Backend #1 and reduce memory imbalance of Backend #3.
  • Figure 4: Serving Evaluation Results.ThunderAgent significantly outperforms vLLM and Continuum across three models, four agentic workflows, and three datasets. For workflows with predictable tool call times (e.g., a, b, d, e), ThunderAgent outperforms vLLM and Continuum up to 2.43-3.56$\times$. For workflows exhibit stochastic tool execution time (e.g., c, f), ThunderAgent still achieves the best throughput performance.
  • Figure 5: KV Cache Hit Rate Statistics.ThunderAgent achieves near-optimal ($\approx$ 100) hit rate with predictable tool call time (a, b, d, e), while dynamically trading hit rate for less idle caching with stochastic tool execution time (c, f). It also achieves higher KV cache hit rate in comparison to vLLM and Continuum.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Lemma 4.1: Quadratic Recomputation Cost
  • Definition 4.1: Eviction Optimization Problem
  • Theorem E.1: Admissible Time Decay Functions
  • proof