Table of Contents
Fetching ...

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang

TL;DR

The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization.

Abstract

The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

TL;DR

The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization.

Abstract

The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.
Paper Structure (25 sections, 17 figures, 4 tables)

This paper contains 25 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Comparison between different systems. Red blocks indicate the devices being a bottleneck in end-to-end training.
  • Figure 2: Intra-worker communication and time breakdown of a training iteration under ZeRO-Infinity. The time is measured when training GPT-3 175B with 8 distributed A100-40GB GPUs and the micro-batch size to 16. We observe that the GPU utilization is low. $N$ represents the number of workers and $S$ represents the total parameter size of the model.
  • Figure 3: Execution breakdown of transformer block computation and network communication in separate streams using different network solutions.
  • Figure 4: Breakdown of a training iteration in SwitchML.
  • Figure 5: Code example representing a worker's training iteration in LuWu and ZeRO-Infinity. LuWu keeps roughly the same programmability as ZeRO-Infinity.
  • ...and 12 more figures