Table of Contents
Fetching ...

FlowMesh: A Service Fabric for Composable LLM Workflows

Junyi Shen, Noppanat Wadlom, Lingfeng Zhou, Dequan Wang, Xu Miao, Lei Fang, Yao Lu

TL;DR

FlowMesh introduces a DAG-based, multi-tenant service fabric for composable LLM workflows that decomposes tasks into fine-grained operators with lineage, enabling cross-tenant deduplication and continuous batching across heterogeneous GPUs. A single global control plane jointly optimizes placement, batching, and data locality, while a stateless data plane automatically scales with demand and relies on a content-addressable store for provenance and reuse. The system is implemented on both Kubernetes and Vast.ai, and evaluated on representative post-training workflows (SFT, RLHF/RLAIF, PPO, DPO) showing up to 3.8x cost reductions and 2.0x energy savings with comparable or better latency, robustness to failures, and scalability. The work unifies workflow graphs, resource heterogeneity, and dynamic provisioning into a portable, fault-tolerant fabric, enabling practical gains in efficiency for modern AI development pipelines.

Abstract

AI deployment increasingly resembles a pipeline of data transformation, fine-tuning, and agent interactions rather than a monolithic LLM job; recent examples include RLHF/RLAIF training and agentic workflows. To cope with this shift, we propose FlowMesh, a multi-tenant service fabric that executes and optimizes these workloads as one shared service instead of isolated pipelines. It decomposes workflows into fine-grained operators with recorded lineage, enabling de-duplication of work across users and batching requests on the same hardware while preserving per-workflow provenance. A global control plane maintains a cluster-wide pool of ready operators and uses a single utility function to pick both the batch and the worker, balancing throughput, cost, and data locality on heterogeneous GPUs. The data plane is an elastic fleet of stateless workers backed by a content-addressable store, enabling rapid, automatic scale-out, safe retry after preemption, and portability across managed clusters such as Kubernetes and geo-distributed GPU marketplaces such as Vast.ai. Compared with baseline solutions, FlowMesh achieves up to 3.8x cost reduction and 2.0x lower energy usage, provides a similar or better latency profile, and remains efficient under dynamic and failure-prone conditions.

FlowMesh: A Service Fabric for Composable LLM Workflows

TL;DR

FlowMesh introduces a DAG-based, multi-tenant service fabric for composable LLM workflows that decomposes tasks into fine-grained operators with lineage, enabling cross-tenant deduplication and continuous batching across heterogeneous GPUs. A single global control plane jointly optimizes placement, batching, and data locality, while a stateless data plane automatically scales with demand and relies on a content-addressable store for provenance and reuse. The system is implemented on both Kubernetes and Vast.ai, and evaluated on representative post-training workflows (SFT, RLHF/RLAIF, PPO, DPO) showing up to 3.8x cost reductions and 2.0x energy savings with comparable or better latency, robustness to failures, and scalability. The work unifies workflow graphs, resource heterogeneity, and dynamic provisioning into a portable, fault-tolerant fabric, enabling practical gains in efficiency for modern AI development pipelines.

Abstract

AI deployment increasingly resembles a pipeline of data transformation, fine-tuning, and agent interactions rather than a monolithic LLM job; recent examples include RLHF/RLAIF training and agentic workflows. To cope with this shift, we propose FlowMesh, a multi-tenant service fabric that executes and optimizes these workloads as one shared service instead of isolated pipelines. It decomposes workflows into fine-grained operators with recorded lineage, enabling de-duplication of work across users and batching requests on the same hardware while preserving per-workflow provenance. A global control plane maintains a cluster-wide pool of ready operators and uses a single utility function to pick both the batch and the worker, balancing throughput, cost, and data locality on heterogeneous GPUs. The data plane is an elastic fleet of stateless workers backed by a content-addressable store, enabling rapid, automatic scale-out, safe retry after preemption, and portability across managed clusters such as Kubernetes and geo-distributed GPU marketplaces such as Vast.ai. Compared with baseline solutions, FlowMesh achieves up to 3.8x cost reduction and 2.0x lower energy usage, provides a similar or better latency profile, and remains efficient under dynamic and failure-prone conditions.

Paper Structure

This paper contains 15 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: FlowMesh decomposes multi-stage LLM workflows into fine-grained, tasks and dispatch them among distributed workers.
  • Figure 2: RLHF/RLAIF through reward and feedback.
  • Figure 3: An agentic workflow orchestrates reasoning, tool use, and reflection.
  • Figure 4: System overview of FlowMesh.
  • Figure 5: FlowMesh compared with baselines. Left: Total cost and energy consumption under identical workload conditions. Right: Cost–Delay Product (CDP) and Energy–Delay Product (EDP).
  • ...and 4 more figures