Table of Contents
Fetching ...

MoLink: Distributed and Efficient Serving Framework for Large Models

Lewei Jin, Yongqi Chen, Kui Zhang, Yifan Zhuo, Yi Gao, Bowei Yang, Zhengong Cai, Wei Dong

TL;DR

MoLink tackles the high cost of serving large language models by exploiting consumer-grade GPUs in heterogeneous, bandwidth-constrained environments. It introduces a dual-node master-worker architecture with a Kubernetes/WSL-friendly control plane, plus dynamic micro-batch scheduling and chunk-based prefill transmission to reduce pipeline bubbles and network contention. Empirical results show up to 458% throughput improvements and up to 151% better cost-profit margins over state-of-the-art baselines, and support for 18 mainstream LLM architectures across Windows, Linux, and containerized VMs. The framework promises low-cost, scalable LLM serving in decentralized settings, expanding accessibility for research and smaller deployments.

Abstract

Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\% and cost-profit margin improvements of up to 151\%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models. The source code is publicly available https://github.com/oldcpple/MoLink.

MoLink: Distributed and Efficient Serving Framework for Large Models

TL;DR

MoLink tackles the high cost of serving large language models by exploiting consumer-grade GPUs in heterogeneous, bandwidth-constrained environments. It introduces a dual-node master-worker architecture with a Kubernetes/WSL-friendly control plane, plus dynamic micro-batch scheduling and chunk-based prefill transmission to reduce pipeline bubbles and network contention. Empirical results show up to 458% throughput improvements and up to 151% better cost-profit margins over state-of-the-art baselines, and support for 18 mainstream LLM architectures across Windows, Linux, and containerized VMs. The framework promises low-cost, scalable LLM serving in decentralized settings, expanding accessibility for research and smaller deployments.

Abstract

Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\% and cost-profit margin improvements of up to 151\%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models. The source code is publicly available https://github.com/oldcpple/MoLink.

Paper Structure

This paper contains 21 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of MoLink.
  • Figure 2: The design of master node.
  • Figure 3: The design of worker node.
  • Figure 4: (a) Pipeline Bubble. (b) Dynamic Micro-batch scheduling.
  • Figure 5: A case of transmission competition between prefill and decode for LLaMa-30B running on RTX 4090(s) linked with 100 mbps bandwidth. The prompt length is 1000. The batch size is 4.
  • ...and 2 more figures