Table of Contents
Fetching ...

Enabling Elastic Model Serving with MultiWorld

Myungjin Lee, Akshay Jajoo, Ramana Rao Kompella

TL;DR

The paper addresses the rigidity of traditional NCCL-style CC Ls in elastic, fault-tolerant model serving for trillion-parameter models. It introduces MultiWorld, which allows workers to belong to multiple worlds, enabling fine-grained fault isolation and online scaling at partition boundaries. The architecture comprises a world manager, world communicator, and watchdog, implemented in about 600 lines on PyTorch, and demonstrates low overheads (1.4–4.3%) in throughput while supporting dynamic join/leave of workers. This framework paves the way for elastic, high-availability inference with NCCL-based deployments, reducing resource waste and improving resilience on large-scale models.

Abstract

Machine learning models have been exponentially growing in terms of their parameter size over the past few years. We are now seeing the rise of trillion-parameter models. The large models cannot fit into a single GPU and thus require partitioned deployment across GPUs and even hosts. A high-performance collective communication library (CCL) such as NCCL is essential to fully utilize expensive GPU resources. However, CCL is not a great fit for inference. Unlike training for which a fixed amount of GPU resources is used for fixed workloads (e.g., input datasets), the inference workloads can change dynamically over time. Failures at the serving time can also impact individual user's experiences directly. In contrast, workers in a CCL process group share a single fault domain and the process group cannot grow as the workloads increase. The gap between the unique characteristics of model serving and CCL's nature makes it hard to serve large models elastically. To bridge the gap, we propose MultiWorld that enables fault tolerance and online scaling at the granularity of workers for model serving. Our evaluation showcases that enabling these new functionalities incurs small overheads (1.4-4.3% throughput loss) for most of the scenarios we tested.

Enabling Elastic Model Serving with MultiWorld

TL;DR

The paper addresses the rigidity of traditional NCCL-style CC Ls in elastic, fault-tolerant model serving for trillion-parameter models. It introduces MultiWorld, which allows workers to belong to multiple worlds, enabling fine-grained fault isolation and online scaling at partition boundaries. The architecture comprises a world manager, world communicator, and watchdog, implemented in about 600 lines on PyTorch, and demonstrates low overheads (1.4–4.3%) in throughput while supporting dynamic join/leave of workers. This framework paves the way for elastic, high-availability inference with NCCL-based deployments, reducing resource waste and improving resilience on large-scale models.

Abstract

Machine learning models have been exponentially growing in terms of their parameter size over the past few years. We are now seeing the rise of trillion-parameter models. The large models cannot fit into a single GPU and thus require partitioned deployment across GPUs and even hosts. A high-performance collective communication library (CCL) such as NCCL is essential to fully utilize expensive GPU resources. However, CCL is not a great fit for inference. Unlike training for which a fixed amount of GPU resources is used for fixed workloads (e.g., input datasets), the inference workloads can change dynamically over time. Failures at the serving time can also impact individual user's experiences directly. In contrast, workers in a CCL process group share a single fault domain and the process group cannot grow as the workloads increase. The gap between the unique characteristics of model serving and CCL's nature makes it hard to serve large models elastically. To bridge the gap, we propose MultiWorld that enables fault tolerance and online scaling at the granularity of workers for model serving. Our evaluation showcases that enabling these new functionalities incurs small overheads (1.4-4.3% throughput loss) for most of the scenarios we tested.
Paper Structure (14 sections, 7 figures)

This paper contains 14 sections, 7 figures.

Figures (7)

  • Figure 1: Throughput for tensor forwarding via Kafka.
  • Figure 2: Illustration of MultiWorld's elasticity. (a) A serving pipeline has three stages where the middle stage is replicated. (b) In case of a worker failure (here P3), worlds containing the failed worker are removed; and the remaining workers continue to work. (c) Online instantiation not only allows fault recovery without restarting all other workers, but it also enables online scaling.
  • Figure 3: MultiWorld Architecture.
  • Figure 4: Fault tolerance of MultiWorld. Across two cases (single world and MultiWorld), a worker gets terminated after sending the 10th tensor. In the single world case (left), the other worker stops working once it detects the failure. In case of MultiWorld (right), the other worker continues to operate successfully.
  • Figure 5: Adding a worker dynamically.
  • ...and 2 more figures