Table of Contents
Fetching ...

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

Yuning Zhang, Yan Yan, Nan Yang, Dong Yuan

TL;DR

AgentServe is presented, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control.

Abstract

Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings.

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

TL;DR

AgentServe is presented, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control.

Abstract

Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings.
Paper Structure (17 sections, 21 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 21 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of cold prefill, resume prefill, and short decode cycles in agent workloads with prefix caching.
  • Figure 2: Time per output token (TPOT) of Qwen2.5-7B and Qwen2.5-3B with three concurrent agents on a single RTX A5000 GPU. Cold prefills introduce long kernels that block concurrent decodes, causing visible spikes in token emission latency.
  • Figure 3: Normalized throughput versus SM share for decode, cold prefill, and resume prefill on Qwen2.5-7B and Qwen2.5-3B (RTX 5090). Decode throughput rises quickly at low SM shares and saturates earlier, while prefill throughput increases more gradually.
  • Figure 4: System architecture of AgentServe. The Application Layer connects users and tools, the Orchestration Layer manages requests and resource allocation, and the Execution Layer enforces prefill–decode disaggregation with CUDA Green Contexts and memory management to ensure decode responsiveness in multiple agent settings.
  • Figure 5: Latency and throughput under different model–device settings. AgentServe consistently achieves the lowest TTFT and TPOT across concurrency levels, while also sustaining competitive throughput compared with baselines. Results are shown for Qwen2.5-3B/7B and Llama-3-8B on both A5000 and 5090 GPUs.
  • ...and 2 more figures