Table of Contents
Fetching ...

Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective

Noppanat Wadlom, Junyi Shen, Yao Lu

Abstract

Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.

Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective

Abstract

Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.
Paper Structure (20 sections, 1 theorem, 6 equations, 13 figures, 7 tables, 2 algorithms)

This paper contains 20 sections, 1 theorem, 6 equations, 13 figures, 7 tables, 2 algorithms.

Key Result

theorem 1

The time complexity of the scheduling algorithm (Algorithm 1) is $O(|V_{int}| \cdot c_{max}^3 + |E'| \cdot d)$.

Figures (13)

  • Figure 1: Three disparities between traditional SQL pipelines and agentic workflows with LLM as operators.
  • Figure 2: Each representative agentic workflow demonstrates a primitive pattern in agent interactions.
  • Figure 3: Overview of Helium's architecture.
  • Figure 4: A workflow DAG (top) and the corresponding templated radix tree with cache-aware schedule (bottom)
  • Figure 5: Normalized end-to-end latency of Helium and baselines, excluding vLLM, across representative workflows and datasets with Qwen3-8B. Values are normalized within each workload so that 1.0 equals the slowest system (lower is better).
  • ...and 8 more figures

Theorems & Definitions (1)

  • theorem 1