Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang; Chaodong Xiao; Aoqi Wu; Xindong Zhang; Lei Zhang

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang

TL;DR

This work tackles the resource barriers of pretraining large language models by introducing SPES, a memory-efficient decentralized framework for MoE LLMs. Each node trains only a subset of experts and shares knowledge through sparse synchronization, significantly reducing memory and communication needs while preserving performance. A key contribution is the expert-merging warm-up, which accelerates early training by enabling cross-node exchange of similar expert parameters. Empirical results show SPES achieves competitive or superior performance to centralized baselines at 2B, 7B, and 9B scales, using weakly connected GPUs and internet links, thereby broadening access to large-scale pretraining and enabling scalable, distributed collaboration.

Abstract

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at https://github.com/zjr2000/SPES.

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

TL;DR

Abstract

Paper Structure (34 sections, 25 equations, 6 figures, 9 tables)

This paper contains 34 sections, 25 equations, 6 figures, 9 tables.

Introduction
Related Work
Memory-Efficient Decentralized Pretraining
Preliminaries
Overall Framework
Sparse Expert Synchronization
Experiments
Experiments Setup
Main Results
Conclusion
Theoretical Analysis of SPES
Problem Setup and Notation
SPES Update Rule
Sparse synchronization.
Expert-merging warm-up.
...and 19 more sections

Figures (6)

Figure 1: Comparison of different pretraining paradigms for LLM.Left: centralized training, which requires high-memory GPUs and high-bandwidth interconnects (e.g., RDMA) for its tightly coupled model or data parallelism. Middle: existing decentralized training (e.g., DiLiCo, Photon), where each node trains a full model locally, reducing bandwidth needs but still demanding high-memory GPUs. Right: our proposed SPES, a memory-efficient decentralized method for training MoE-based LLMs, where each node trains only a subset of experts, substantially reducing both per-GPU memory usage and communication overhead.
Figure 2: (a) Illustration of our model structure, in which we utilize an MoE LLM comprising standard self-attention blocks, normalization layers, and routed feed-forward modules. (b) Illustration of SPES, where each node performs local training on a disjoint subset of experts to reduce memory consumption. During weight synchronization, only the trained parameters are transmitted to the parameter server, minimizing communication overhead. To improve data utilization, we propose an expert-merging strategy that merges similar experts to facilitate knowledge sharing.
Figure 3: Memory and communication costs across training paradigms. Experiments are conducted with a batch size of 2 and a sequence length of 2048. For the 2B model, we employ PyTorch DDP. For the 7B model, we utilize FSDP across 8 GPUs.
Figure 4: Performance comparison across different training paradigms. Performance during training is evaluated using the evaluation suite integrated into the open-source OLMo codebase.
Figure A1: Ablation on key hyper-parameters in expert merging. The reported average is computed over ARC(e), SciQ, PIQA, WinoGrande, ARC(c), OBQA, OpenBookQA, and SIQA.
...and 1 more figures

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

TL;DR

Abstract

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Authors

TL;DR

Abstract

Table of Contents

Figures (6)