Table of Contents
Fetching ...

Parallax: Efficient LLM Inference Service over Decentralized Environment

Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, Binhang Yuan

TL;DR

Parallax tackles the challenge of running LLM inference on decentralized, heterogeneous volunteer GPUs by introducing a two-phase scheduling framework. Phase 1 uses dynamic programming with region-based heuristics and water-filling to allocate model layers across GPUs, balancing latency and throughput under memory and bandwidth constraints. Phase 2 treats the pre-allocated layers as a DAG and, using a live performance map stored in a distributed hash table, performs a per-request DP to assemble the optimal GPU pipeline chain that minimizes end-to-end latency. Empirical results show substantial improvements in both latency and throughput over a decentralized baseline, validating that careful placement and online chain selection enable affordable, scalable LLM inference on volunteer compute resources.

Abstract

Deploying a large language model (LLM) inference service remains costly because centralized serving depends on specialized GPU clusters and high-bandwidth interconnects in datacenters. An appealing alternative is to leverage collaborative decentralized GPU pools. However, heterogeneity in GPU and limited interconnected network bandwidth, along with potentially dynamic availability, make efficient scheduling the central challenge in this scenario. In this paper, we present Parallax, a decentralized LLM serving system that turns a pool of heterogeneous GPUs into an efficient inference platform via a two-phase scheduler. Parallax decomposes planning into (i) model allocation, which places layers of each replica across diverse GPUs to jointly optimize latency and throughput under memory and link-bandwidth constraints, and (ii) request-time GPU pipeline selection, which stitches layers from different replicas into end-to-end execution chains that balance load and adapt to current conditions. We implement Parallax and evaluate it on open-source LLMs deployed over real volunteer nodes. Parallax consistently reduces latency and increases throughput relative to decentralized baselines, demonstrating that principled scheduling can make volunteer compute a practical, affordable substrate for LLM inference. Github Repo at: https://github.com/GradientHQ/parallax.

Parallax: Efficient LLM Inference Service over Decentralized Environment

TL;DR

Parallax tackles the challenge of running LLM inference on decentralized, heterogeneous volunteer GPUs by introducing a two-phase scheduling framework. Phase 1 uses dynamic programming with region-based heuristics and water-filling to allocate model layers across GPUs, balancing latency and throughput under memory and bandwidth constraints. Phase 2 treats the pre-allocated layers as a DAG and, using a live performance map stored in a distributed hash table, performs a per-request DP to assemble the optimal GPU pipeline chain that minimizes end-to-end latency. Empirical results show substantial improvements in both latency and throughput over a decentralized baseline, validating that careful placement and online chain selection enable affordable, scalable LLM inference on volunteer compute resources.

Abstract

Deploying a large language model (LLM) inference service remains costly because centralized serving depends on specialized GPU clusters and high-bandwidth interconnects in datacenters. An appealing alternative is to leverage collaborative decentralized GPU pools. However, heterogeneity in GPU and limited interconnected network bandwidth, along with potentially dynamic availability, make efficient scheduling the central challenge in this scenario. In this paper, we present Parallax, a decentralized LLM serving system that turns a pool of heterogeneous GPUs into an efficient inference platform via a two-phase scheduler. Parallax decomposes planning into (i) model allocation, which places layers of each replica across diverse GPUs to jointly optimize latency and throughput under memory and link-bandwidth constraints, and (ii) request-time GPU pipeline selection, which stitches layers from different replicas into end-to-end execution chains that balance load and adapt to current conditions. We implement Parallax and evaluate it on open-source LLMs deployed over real volunteer nodes. Parallax consistently reduces latency and increases throughput relative to decentralized baselines, demonstrating that principled scheduling can make volunteer compute a practical, affordable substrate for LLM inference. Github Repo at: https://github.com/GradientHQ/parallax.

Paper Structure

This paper contains 12 sections, 5 figures.

Figures (5)

  • Figure 1: Example of the first phase model allocation among heterogeneous GPU types across different geographic regions.
  • Figure 2: Example of the second phase GPU pipeline chain selection among GPUs (pipeline stages).
  • Figure 3: End-to-end latency comparison between Parallax and HexGen across different models, traces, and request arrival rates.
  • Figure 4: End-to-end throughput comparison between Parallax and HexGen across different models, traces, and request arrival rates.
  • Figure 5: Phase-1 and phase-2 algorithm running time when scaling from smaller clusters (e.g., 4 GPUs) to larger clusters (e.g., 256 GPUs).