Table of Contents
Fetching ...

Batch Query Processing and Optimization for Agentic Workflows

Junyi Shen, Noppanat Wadlom, Yao Lu

TL;DR

Halo addresses inefficiencies in agentic LLM workflows by modeling each workflow as a dependency-aware DAG and optimizing execution at the batch level across CPU and GPU resources. It introduces an epoch-based dynamic-programming scheduler that jointly decides placements, batching, and tool preparation, while a Processor enforces plan-adherent, pipelined execution with cross-workflow coalescing and KV-cache sharing. The system demonstrates up to 3.6x batch-inference speedups and 2.6x online throughput improvements across diverse benchmarks, with near-optimal scheduling compared to MILP Oracle and strong robustness to scale and heterogeneity. Halo’s unified optimization across heterogeneous operators enables efficient data-analytic and decision-making workflows on a single machine, providing a practical path toward scalable agentic analytics.

Abstract

Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. As a result, repeated prompts, overlapping contexts, and fragmented CPU-GPU execution create substantial redundancy and poor hardware utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic LLM workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers heterogeneous resource constraints, prefill and decode costs, cache reuse, and GPU placement, Halo performs plan-level optimization to minimize redundant execution. The Processor integrates adaptive batching, KV-cache sharing and migration, along with fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 3.6x speedup for batch inference and 2.6x throughput improvement under online serving, scaling to workloads of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with heterogeneous LLM serving, Halo enables efficient agentic workflows in data analytics and decision-making applications.

Batch Query Processing and Optimization for Agentic Workflows

TL;DR

Halo addresses inefficiencies in agentic LLM workflows by modeling each workflow as a dependency-aware DAG and optimizing execution at the batch level across CPU and GPU resources. It introduces an epoch-based dynamic-programming scheduler that jointly decides placements, batching, and tool preparation, while a Processor enforces plan-adherent, pipelined execution with cross-workflow coalescing and KV-cache sharing. The system demonstrates up to 3.6x batch-inference speedups and 2.6x online throughput improvements across diverse benchmarks, with near-optimal scheduling compared to MILP Oracle and strong robustness to scale and heterogeneity. Halo’s unified optimization across heterogeneous operators enables efficient data-analytic and decision-making workflows on a single machine, providing a practical path toward scalable agentic analytics.

Abstract

Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while multi-agent frameworks focus on orchestration without system-level performance planning. As a result, repeated prompts, overlapping contexts, and fragmented CPU-GPU execution create substantial redundancy and poor hardware utilization, especially in batch analytics scenarios. We introduce Halo, a system that brings batch query processing and optimization into agentic LLM workflows. Halo represents each workflow as a structured query plan DAG and constructs a consolidated graph for batched queries that exposes shared computation. Guided by a cost model that jointly considers heterogeneous resource constraints, prefill and decode costs, cache reuse, and GPU placement, Halo performs plan-level optimization to minimize redundant execution. The Processor integrates adaptive batching, KV-cache sharing and migration, along with fine-grained CPU-GPU pipelining to maximize holistic hardware efficiency. Evaluation across six benchmarks shows that Halo achieves up to 3.6x speedup for batch inference and 2.6x throughput improvement under online serving, scaling to workloads of thousands of queries and complex graphs. These gains are achieved without compromising output quality. By unifying query optimization with heterogeneous LLM serving, Halo enables efficient agentic workflows in data analytics and decision-making applications.

Paper Structure

This paper contains 17 sections, 9 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: An example agentic workflow in which multiple LLM agents collaborate to analyze revenue data and provide decision support for businesses.
  • Figure 2: Complex agentic workflows involve modular, collaborative, and adaptive processes.
  • Figure 3: Halo system overview.
  • Figure 4: Halo's Processor that maps an execution plan to coordinated execution across heterogeneous CPU and GPU workers.
  • Figure 5: Workflow example (W6).
  • ...and 6 more figures