Table of Contents
Fetching ...

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

Xingyu Xie, Zhijie Lin, Kim-Chuan Toh, Pan Zhou

TL;DR

LoCo tackles the critical bottleneck of gradient communication in large-scale model training by introducing a moving-average compensated, low-bit gradient adaptor that compresses gradients to $4$-bit while storing a compact $8$-bit compensation error. The method is optimizer-agnostic and integrates with FSDP via an all2all gradient aggregation, achieving convergence speeds matching full-precision baselines for SGD and Adam-family optimizers in nonconvex settings. Theoretical guarantees accompany extensive experiments showing $14\%$–$40\%$ training-speed improvements on LLMs like LLAMA2 and Mixtral, with modest memory overhead and strong compatibility with Megatron-LM and FSDP. Overall, LoCo provides a scalable, convergent, and practical solution for efficient large-model training under low-bit communication regimes.

Abstract

To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE.

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

TL;DR

LoCo tackles the critical bottleneck of gradient communication in large-scale model training by introducing a moving-average compensated, low-bit gradient adaptor that compresses gradients to -bit while storing a compact -bit compensation error. The method is optimizer-agnostic and integrates with FSDP via an all2all gradient aggregation, achieving convergence speeds matching full-precision baselines for SGD and Adam-family optimizers in nonconvex settings. Theoretical guarantees accompany extensive experiments showing training-speed improvements on LLMs like LLAMA2 and Mixtral, with modest memory overhead and strong compatibility with Megatron-LM and FSDP. Overall, LoCo provides a scalable, convergent, and practical solution for efficient large-model training under low-bit communication regimes.

Abstract

To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE.
Paper Structure (45 sections, 8 theorems, 75 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 45 sections, 8 theorems, 75 equations, 4 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that Assumptions asm:Lsmooth, asm:boundVar, and asm:p-bit hold. Let $s_e = \Omega(\epsilon^{-4})$ and $\eta = \order{\epsilon^2}$ in LoCo-integrated SGD. Then, after $T= \Omega\qty(\epsilon^{-4})$ iterations, we have: That is, the stochastic gradient complexity to find an $\epsilon$-accurate first-order stationary point is $\order{\epsilon^{-4}}$.

Figures (4)

  • Figure 1: Illustration of LoCo. At iteration $k$, before compression,LoCo compensates the full-precision gradient with the compression error from previous iterations to reduce the compression error. Then, it compresses the gradient into a low-bit one for fast communication among GPU nodes.
  • Figure 2: Loss curves of various low-bit optimization methods. (a) Training results on GPT2-345M with 52B tokens from the OpenWebtext dataset, showing that 4-bit LoCo achieves performance comparable to 16-bit Adam. (b) Training results on LLaMA2-0.8B with 30B tokens from RedPajama-v2, where LoCo-Zero++ achieves training quality on par with 16-bit AdamW and outperforms Zero++. (c) Training results on LLaMA2-8B with 5B tokens, illustrating the effectiveness of LoCo-Zero++ in maintaining high training quality even for larger model sizes.
  • Figure 3: Overview of All-Reduce and Its Component Operations. The All-Reduce process is depicted in two main phases. Initially, the Reduce-Scatter operations is performed, where gradients are divided and summed up in equal blocks across GPUs according to their ranks. This is followed by the All-Gather phase, where each GPU shares its segment of the aggregated gradients, ensuring the complete set of gradients is available to all GPUs. Additionally, the figure includes representations of the Ring reduce-scatter and Alltoall operations, integral in the gradient distribution and aggregation process across the cluster.
  • Figure 4: Overview of sharding strategy. (1) Initial Framework without Sharding, where each GPU stores the entire model, leading to high memory usage but simplified communication; (2) Sharded Model, introducing a strategy where GPUs maintain only local partitions of optimizer states and averaged gradients, enhancing memory efficiency through reduce-scatter and all-gather operations for gradient and weight management; (3) Fully Sharded Data Parallelism, further optimizing memory by limiting each GPU to a partition of the model weights, necessitating inter-GPU collection of weight partitions for model assembly, significantly reducing memory requirements and enabling the training of large-scale models previously constrained by memory limitations.

Theorems & Definitions (16)

  • Theorem 1: SGD Convergence
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • proof
  • proof
  • Lemma 5
  • proof
  • ...and 6 more