ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System
Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, Simran Arora
TL;DR
ThunderAgent tackles throughput and memory inefficiencies in large-scale agentic inference by introducing a program-centric abstraction that unifies KV-cache management and tool environments. The core approach combines a program-aware scheduler with a global waiting queue and lifecycle-aware tool resource management, underpinned by a Space-Time Product (STP) cost model where $ \text{Cost}_{\text{total}} \approx \text{Cost}_{\text{decode}} + \text{Cost}_{\text{prefill}} + \text{Cost}_{\text{recompute}} + \text{Cost}_{\text{unused}} + \text{Cost}_{\text{caching}}$ and $\text{Cost}_{\text{recompute}} \propto c_i^2$. ThunderAgent detects and mitigates KV-cache thrashing via periodic memory checks and a Shortest-First Eviction strategy, while a global program-aware waiting queue addresses cross-node memory imbalance; asynchronous environment preparation and hook-based garbage collection reduce tool-resource overhead and leakage. Empirical results show 1.5–3.6x throughput gains in serving, 1.8–3.9x in RL rollout, and up to 4.2x disk-memory savings across diverse agent workloads, validating the practicality and scalability of end-to-end program-aware optimization. The open-source release at https://github.com/Agentic-Kinetics/ThunderAgent enables reproducibility and further development of program-centric agent inference systems.
Abstract
Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.
