Table of Contents
Fetching ...

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

TL;DR

Helix introduces a max-flow based framework for serving large language models across heterogeneous GPU clusters and networks. By modeling model placement as a MILP-optimized max-flow problem and introducing per-request pipelines, Helix jointly optimizes where to place model layers and how to route individual requests, achieving higher throughput and lower latency than heterogeneous baselines in both single and geo-distributed deployments. The approach is live-implemented atop vLLM with a dedicated MILP solver and a simulator, and is validated on LLaMA-30B/70B workloads across diverse hardware mixes, demonstrating up to 3.3x throughput gains and notable latency reductions. The work lays a foundation for scalable, network-aware LLM serving in heterogeneous data centers and across regions, with practical implications for cloud providers and large-scale AI deployments.

Abstract

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving in heterogeneous GPU clusters. The key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem on directed, weighted graphs, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs on heterogeneous GPUs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous clusters ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 3.3x and reduces prompting and decoding latency by up to 66% and 24%, respectively, compared to existing approaches. Helix is available at https://github.com/Thesys-lab/Helix-ASPLOS25.

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

TL;DR

Helix introduces a max-flow based framework for serving large language models across heterogeneous GPU clusters and networks. By modeling model placement as a MILP-optimized max-flow problem and introducing per-request pipelines, Helix jointly optimizes where to place model layers and how to route individual requests, achieving higher throughput and lower latency than heterogeneous baselines in both single and geo-distributed deployments. The approach is live-implemented atop vLLM with a dedicated MILP solver and a simulator, and is validated on LLaMA-30B/70B workloads across diverse hardware mixes, demonstrating up to 3.3x throughput gains and notable latency reductions. The work lays a foundation for scalable, network-aware LLM serving in heterogeneous data centers and across regions, with practical implications for cloud providers and large-scale AI deployments.

Abstract

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving in heterogeneous GPU clusters. The key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem on directed, weighted graphs, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs on heterogeneous GPUs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous clusters ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 3.3x and reduces prompting and decoding latency by up to 66% and 24%, respectively, compared to existing approaches. Helix is available at https://github.com/Thesys-lab/Helix-ASPLOS25.
Paper Structure (68 sections, 12 figures, 8 tables)

This paper contains 68 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Examples of sub-optimal model placement and request schedule. \ref{['fig:sec3-placement-cluster']}) all GPUs and network condition in this example. The order of compute capacity is: A100 > L4 > T4; \ref{['fig:sec3-placement-equal-model']}) Model placement by uniformly partition the model, then allocate devices by a balanced compute capacity; \ref{['fig:sec3-placement-equal-compute']}) Co-optimizing model partition and device placement to make the compute capacity more balanced; \ref{['fig:sec3-placement-net-aware']}) Co-optimizing model partition, device placement, and request scheduling in a network-aware way.
  • Figure 2: Graph abstraction of a 3-node cluster with given model placement. Numbers on the edges in Fig. \ref{['fig:sec4-network-flow-graph']} represent their capacity, which is the number of tokens that can pass through the edges per second. Max flow between source and sink equals the max serving throughput of the cluster.
  • Figure 3: Helix overview. In Helix, the coordinator plans model placement as described in Sec. \ref{['sec4:formulation-milp']}. We only need to run model placement once for each cluster. When a new request arrives, the coordinator node runs Helix scheduler to assign it a per-request pipeline and sends it to the first node in the pipeline. Each compute node in the pipeline performs inference on the request on the layers it is responsible for and sends the (output for the) request to the next node in the pipeline. When the last node in the pipeline finishes performing inference on its layers, it will send the output token for the request to the coordinator (Worker Finished). The coordinator schedules generation of the next token for the request using the same pipeline.
  • Figure 4: Topology graph of a cluster, where each vertex is a compute node, and each edge is a valid network connection. Numbers over edges represent the flow over the network connection in the max flow solution. The pipelines used to schedule requests 1 and 2 are shown on the right.
  • Figure 5: Statistics of Azure Conversation dataset.
  • ...and 7 more figures