Table of Contents
Fetching ...

Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks

Long Shi, Bingyan Ou, Kang Wei, Weihao Zhu, Zhe Wang, Zhiyong Chen

TL;DR

The paper tackles throughput bottlenecks in distributed Mixture-of-Experts (MoE) training over resource-heterogeneous edge networks with stochastic token arrivals. It introduces Stable-MoE, a Lyapunov-optimization framework that jointly optimizes token routing and computation frequency to maximize system throughput and gating consistency while ensuring long-term stability of token and energy queues. By converting the long-term stochastic problem into per-slot subproblems, it enables online decisions without future state knowledge, solved via a solver-based approach. Empirical results on SVHN and CIFAR-100 demonstrate significant throughput and accuracy gains over multiple baselines and confirm queue stability, highlighting the practical viability of Lyapunov-based routing for edge MoE systems.

Abstract

The sparse activation mechanism of mixture of experts (MoE) model empowers edge intelligence with enhanced training efficiency and reduced computational resource consumption. However, traditional token routing in distributed MoE training faces significant challenges in resource-constrained edge networks characterized by heterogeneous computing capabilities and stochastic token arrivals, which inevitably suffer from workload backlog, resource inefficiency, and performance degradation. To address this issue, we propose a novel Lyapunov-based token routing framework for distributed MoE training over resource-heterogeneous edge networks, termed Stable-MoE. Specifically, we formulate a stochastic optimization problem to maximize both system throughput and gating consistency via optimizing the token routing strategy and computational resource allocation, while ensuring long-term stability of both token and energy queues at the edge devices. Using the Lyapunov optimization, we transform the intractable long-term optimization problem into tractable per-slot subproblems by enabling online decision-making of token routing and computation frequency utilization without the knowledge of future system states. Experimental results on the SVHN and CIFAR-100 datasets demonstrate that Stable-MoE outperforms the baselines with at least 40% and 5% gains in system throughput and test accuracy, respectively.

Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks

TL;DR

The paper tackles throughput bottlenecks in distributed Mixture-of-Experts (MoE) training over resource-heterogeneous edge networks with stochastic token arrivals. It introduces Stable-MoE, a Lyapunov-optimization framework that jointly optimizes token routing and computation frequency to maximize system throughput and gating consistency while ensuring long-term stability of token and energy queues. By converting the long-term stochastic problem into per-slot subproblems, it enables online decisions without future state knowledge, solved via a solver-based approach. Empirical results on SVHN and CIFAR-100 demonstrate significant throughput and accuracy gains over multiple baselines and confirm queue stability, highlighting the practical viability of Lyapunov-based routing for edge MoE systems.

Abstract

The sparse activation mechanism of mixture of experts (MoE) model empowers edge intelligence with enhanced training efficiency and reduced computational resource consumption. However, traditional token routing in distributed MoE training faces significant challenges in resource-constrained edge networks characterized by heterogeneous computing capabilities and stochastic token arrivals, which inevitably suffer from workload backlog, resource inefficiency, and performance degradation. To address this issue, we propose a novel Lyapunov-based token routing framework for distributed MoE training over resource-heterogeneous edge networks, termed Stable-MoE. Specifically, we formulate a stochastic optimization problem to maximize both system throughput and gating consistency via optimizing the token routing strategy and computational resource allocation, while ensuring long-term stability of both token and energy queues at the edge devices. Using the Lyapunov optimization, we transform the intractable long-term optimization problem into tractable per-slot subproblems by enabling online decision-making of token routing and computation frequency utilization without the knowledge of future system states. Experimental results on the SVHN and CIFAR-100 datasets demonstrate that Stable-MoE outperforms the baselines with at least 40% and 5% gains in system throughput and test accuracy, respectively.

Paper Structure

This paper contains 7 sections, 1 theorem, 13 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

For any queue backlogs and actions, $\Delta_V(t)$ is upper bounded by where

Figures (4)

  • Figure 1: The distributed MoE training over edge networks with a router and distributed edge servers.
  • Figure 2: Queue backlogs of Stable-MoE versus $\mathcal{T}$. The blue line represents the instantaneous backlog observed in each round, whereas the red dashed lines indicate the global mean values obtained by averaging over all rounds.
  • Figure 3: Throughput comparison between Stable-MoE and Strategies A-D.
  • Figure 4: Accuracy comparison between Stable-MoE and Strategies A-D.

Theorems & Definitions (2)

  • Lemma 1
  • proof