Table of Contents
Fetching ...

Communication Efficient Distributed Training with Distributed Lion

Bo Liu, Lemeng Wu, Lizhang Chen, Kaizhao Liang, Jiaxu Zhu, Chen Liang, Raghuraman Krishnamoorthi, Qiang Liu

TL;DR

This paper introduces Distributed Lion, a communication-efficient distributed training framework built on the Lion optimizer. Each worker maintains its own optimizer state and transmits only binary updates $\delta_{i,t}$ to a central server, which aggregates them into $\Delta_t$ by averaging or majority voting and broadcasts back for parameter updates, dramatically reducing bandwidth compared to FP32-gradient communication. The authors formalize the method as a box-constrained optimization with a two-phase dynamics and provide convergence guarantees for both aggregation schemes, supported by empirical results on CIFAR-10, ImageNet, and large-language-model tasks. Across vision and language benchmarks, Distributed Lion achieves performance comparable to global Lion or AdamW while offering 30x–32x reductions in communication, and often outperforms existing efficient distributed methods such as DGC and TernGrad in the bandwidth-accuracy trade-off. The work highlights a promising direction for scalable, resource-efficient distributed training of large AI models.

Abstract

The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.

Communication Efficient Distributed Training with Distributed Lion

TL;DR

This paper introduces Distributed Lion, a communication-efficient distributed training framework built on the Lion optimizer. Each worker maintains its own optimizer state and transmits only binary updates to a central server, which aggregates them into by averaging or majority voting and broadcasts back for parameter updates, dramatically reducing bandwidth compared to FP32-gradient communication. The authors formalize the method as a box-constrained optimization with a two-phase dynamics and provide convergence guarantees for both aggregation schemes, supported by empirical results on CIFAR-10, ImageNet, and large-language-model tasks. Across vision and language benchmarks, Distributed Lion achieves performance comparable to global Lion or AdamW while offering 30x–32x reductions in communication, and often outperforms existing efficient distributed methods such as DGC and TernGrad in the bandwidth-accuracy trade-off. The work highlights a promising direction for scalable, resource-efficient distributed training of large AI models.

Abstract

The Lion optimizer has been a promising competitor with the AdamW for training large AI models, with advantages on memory, computation, and sample efficiency. In this paper, we introduce Distributed Lion, an innovative adaptation of Lion for distributed training environments. Leveraging the sign operator in Lion, our Distributed Lion only requires communicating binary or lower-precision vectors between workers to the center server, significantly reducing the communication cost. Our theoretical analysis confirms Distributed Lion's convergence properties. Empirical results demonstrate its robustness across a range of tasks, worker counts, and batch sizes, on both vision and language problems. Notably, Distributed Lion attains comparable performance to standard Lion or AdamW optimizers applied on aggregated gradients, but with significantly reduced communication bandwidth. This feature is particularly advantageous for training large models. In addition, we also demonstrate that Distributed Lion presents a more favorable performance-bandwidth balance compared to existing efficient distributed methods such as deep gradient compression and ternary gradients.
Paper Structure (29 sections, 16 theorems, 49 equations, 4 figures, 4 tables)

This paper contains 29 sections, 16 theorems, 49 equations, 4 figures, 4 tables.

Key Result

Theorem 4.4

Assume $f\colon \mathbb{R}^d\to \mathbb{R}$ is $L$-smooth, $\beta_1,\beta_2 \in (0,1)$, and $\beta_2>\beta_1$, and $\epsilon, \lambda > 0$. Let $(x_t)_{t\geq 0}$ be generated by Algorithm alg:dist-lion. Define $\mathcal{F} =\{x \colon \left\lVert\lambda x\right\rVert_\infty \leq 1 \}$, and $\mathrm{

Figures (4)

  • Figure 1: Illustration of Distributed-Lion. Each worker keeps its own optimizer state and applies the Lion optimizer individually to a binary update $\delta_{i,t} =\texttt{Lion}(x, \mathcal{D}_i)$ (without the weight decay), then the server aggregates all $\delta_{i,t}$ to produce a binary $\Delta_t$ by majority vote (or a integer $\Delta_t$ by averaging) and send it back to all workers. The workers then apply $\Delta_t$ and weight decay to update their model parameters. See Algorithm \ref{['alg:dist-lion']} for details.
  • Figure 2: Performance of Distributed Lion v.s. other efficient distributed optimizers on CIFAR-10 with 4, 8, 16, and 32 workers, each worker at each iteration runs on a local batch with size 32. All results are averaged over three seeds.
  • Figure 3: Performance of different methods v.s. $k$.
  • Figure 4: Test Error v.s. Communication Bits per Iteration (closer to the lower-left is better). Note that we set G-Lion and G-AdamW are both 64, because they require 32 bits per parameter, and there are both worker-to-server and server-to-worker communications.

Theorems & Definitions (27)

  • Theorem 4.4: Phase I
  • Proposition 4.5
  • Theorem 4.6: Majority Vote
  • Theorem 4.7: Global
  • Theorem 4.8: Averaging
  • Theorem 1.1: Phase I
  • proof
  • Proposition 1.5
  • proof
  • Theorem 1.6: Convergence in Phase II
  • ...and 17 more