Table of Contents
Fetching ...

Theory of Mixture-of-Experts for Mobile Edge Computing

Hongbo Li, Lingjie Duan

TL;DR

The paper tackles continual learning in mobile edge computing (MEC) where streaming tasks with unknown data distributions arrive at edge servers. It proposes a mixture-of-experts (MoE) framework, treating each MEC server as an expert and using an adaptive gating network to route tasks to specialized idle experts under server availability constraints, with switch routing and a locality loss to promote specialization. The authors establish a lower bound on the number of experts $M_{th}$ to ensure convergence, prove router convergence after exploration time, and show the overall generalization error can be bounded by a small constant $O(\sigma_0^2)$ as the horizon grows, with detailed analysis of how expert count affects convergence. They validate the theory with real-data experiments on linear models and DNNs (MNIST), demonstrating that MoE reduces forgetting and improves generalization over time compared to traditional MEC offloading strategies, especially when $M$ meets the derived requirements.

Abstract

In mobile edge computing (MEC) networks, mobile users generate diverse machine learning tasks dynamically over time. These tasks are typically offloaded to the nearest available edge server, by considering communication and computational efficiency. However, its operation does not ensure that each server specializes in a specific type of tasks and leads to severe overfitting or catastrophic forgetting of previous tasks. To improve the continual learning (CL) performance of online tasks, we are the first to introduce mixture-of-experts (MoE) theory in MEC networks and save MEC operation from the increasing generalization error over time. Our MoE theory treats each MEC server as an expert and dynamically adapts to changes in server availability by considering data transfer and computation time. Unlike existing MoE models designed for offline tasks, ours is tailored for handling continuous streams of tasks in the MEC environment. We introduce an adaptive gating network in MEC to adaptively identify and route newly arrived tasks of unknown data distributions to available experts, enabling each expert to specialize in a specific type of tasks upon convergence. We derived the minimum number of experts required to match each task with a specialized, available expert. Our MoE approach consistently reduces the overall generalization error over time, unlike the traditional MEC approach. Interestingly, when the number of experts is sufficient to ensure convergence, adding more experts delays the convergence time and worsens the generalization error. Finally, we perform extensive experiments on real datasets in deep neural networks (DNNs) to verify our theoretical results.

Theory of Mixture-of-Experts for Mobile Edge Computing

TL;DR

The paper tackles continual learning in mobile edge computing (MEC) where streaming tasks with unknown data distributions arrive at edge servers. It proposes a mixture-of-experts (MoE) framework, treating each MEC server as an expert and using an adaptive gating network to route tasks to specialized idle experts under server availability constraints, with switch routing and a locality loss to promote specialization. The authors establish a lower bound on the number of experts to ensure convergence, prove router convergence after exploration time, and show the overall generalization error can be bounded by a small constant as the horizon grows, with detailed analysis of how expert count affects convergence. They validate the theory with real-data experiments on linear models and DNNs (MNIST), demonstrating that MoE reduces forgetting and improves generalization over time compared to traditional MEC offloading strategies, especially when meets the derived requirements.

Abstract

In mobile edge computing (MEC) networks, mobile users generate diverse machine learning tasks dynamically over time. These tasks are typically offloaded to the nearest available edge server, by considering communication and computational efficiency. However, its operation does not ensure that each server specializes in a specific type of tasks and leads to severe overfitting or catastrophic forgetting of previous tasks. To improve the continual learning (CL) performance of online tasks, we are the first to introduce mixture-of-experts (MoE) theory in MEC networks and save MEC operation from the increasing generalization error over time. Our MoE theory treats each MEC server as an expert and dynamically adapts to changes in server availability by considering data transfer and computation time. Unlike existing MoE models designed for offline tasks, ours is tailored for handling continuous streams of tasks in the MEC environment. We introduce an adaptive gating network in MEC to adaptively identify and route newly arrived tasks of unknown data distributions to available experts, enabling each expert to specialize in a specific type of tasks upon convergence. We derived the minimum number of experts required to match each task with a specialized, available expert. Our MoE approach consistently reduces the overall generalization error over time, unlike the traditional MEC approach. Interestingly, when the number of experts is sufficient to ensure convergence, adding more experts delays the convergence time and worsens the generalization error. Finally, we perform extensive experiments on real datasets in deep neural networks (DNNs) to verify our theoretical results.

Paper Structure

This paper contains 17 sections, 14 theorems, 68 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

For the selected expert $m_t$, after completing task $t$ at time $t+d_t$, its expert model is updated to be ( gunasekar2018characterizingevron2022catastrophiclin2023theory): While for any other expert $m\neq m_t$, we keep its model unchanged at time $t+d_t$, i.e.,

Figures (4)

  • Figure 1: An illustration of MEC networks with $M$ edge servers as experts. At the beginning of time $t$, a mobile user arrives to request a task-training service from its nearest Base Station (BS) of expert $\tilde{m}_t$ (e.g., $\tilde{m}_t=1$ in this case). Then our adaptive MoE gating network in Fig. \ref{['fig:gating']} selects one idle expert $m_t$ (e.g., $m_t=2$ in this case) out of $M$ experts and asks BS of expert $\tilde{m}_t$ to forward the task dataset to BS of the chosen expert $m_t$. After completing the task learning, the selected expert $m_t$ updates its local model and transmits the training result back to the mobile user via expert $\Tilde{m}_t$'s BS. Finally, the MoE updates the gating network for subsequent task use (see Fig. \ref{['fig:gating']}).
  • Figure 2: The MoE structure of the MEC network operator in Fig. \ref{['fig:MEC']}, which contains a gating network and a router. After a mobile user arrives and uploads its dataset $\mathcal{D}_t$ to the MEC network operator (step 1), the gating network computes its linear output $\mathbf{h}(\mathbf{X}_t,\mathbf{\Theta}_t)$ by (\ref{['h_X_theta']}) based on the input dataset $\mathcal{D}_t$ (step 2). Then, the router selects the best expert $m_t$ for training task $t$ by the adaptive strategy (\ref{['m_t']}), based on the gating output $\mathbf{h}(\mathbf{X}_t,\mathbf{\Theta}_t)$ (step 3). After completing the data training, expert $m_t$ updates its local model and outputs its learning result back to the mobile user (step 4). Finally, the MEC network operator updates gating network parameter $\mathbf{\Theta}_{t+d_t}$ based on the learning result and the softmaxed value $\bm{\pi}(\mathbf{X}_t,\mathbf{\Theta}_t)$ derived in (\ref{['softmax']}) (step 5).
  • Figure 3: The dynamics of overall generalization errors under our MoE \ref{['algo:update_MoE']} and the MEC existing offloading strategies that always select the nearest or the most powerful available server (e.g., ouyang2018followshakarami2020surveygao2019winningyan2021pricing). Here we set $T=3000$, $N=10$, $\sigma_0=0.6,d_u=10, \eta=0.2,p=15,s=10$, and vary $M\in\{10,30,50,70\}$.
  • Figure 4: The dynamics of overall generalization errors under our \ref{['algo:update_MoE']} and the existing MEC offloading strategies, using DNNs in MNIST datasets lecun1989handwritten.

Theorems & Definitions (23)

  • Definition 1
  • Lemma 1
  • Definition 2
  • Lemma 2
  • Proposition 1
  • Proposition 2: Router's Convergence
  • Proposition 3: Experts' Learning Convergence
  • Proposition 4
  • Theorem 1
  • Lemma 3
  • ...and 13 more