Table of Contents
Fetching ...

CoSMoEs: Compact Sparse Mixture of Experts

Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed Aly, Adithya Sagar

TL;DR

CoSMoEs tackles the challenge of deploying sparse Mixture-of-Experts on edge devices by introducing weight-decomposed experts and a block-wise expert selection loss to reduce offloads. The approach yields a clear quality advantage over fair-dact dense baselines (at least 2.35% absolute, with WD adding up to 1.1%), while BlES dramatically improves memory efficiency and generation latency (6x fewer expert replacements and ~50% faster generation). Training-time innovations deliver 5–10x efficiency versus dense models, enabling on-device deployment at phone and wearable scales. Overall, CoSMoEs demonstrates how compact MoEs can achieve high-quality, privacy-preserving on-device inference with practical memory/latency trade-offs.

Abstract

Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

CoSMoEs: Compact Sparse Mixture of Experts

TL;DR

CoSMoEs tackles the challenge of deploying sparse Mixture-of-Experts on edge devices by introducing weight-decomposed experts and a block-wise expert selection loss to reduce offloads. The approach yields a clear quality advantage over fair-dact dense baselines (at least 2.35% absolute, with WD adding up to 1.1%), while BlES dramatically improves memory efficiency and generation latency (6x fewer expert replacements and ~50% faster generation). Training-time innovations deliver 5–10x efficiency versus dense models, enabling on-device deployment at phone and wearable scales. Overall, CoSMoEs demonstrates how compact MoEs can achieve high-quality, privacy-preserving on-device inference with practical memory/latency trade-offs.

Abstract

Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

Paper Structure

This paper contains 29 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Server-side training environment (left) compared to the memory-constraint inference environment (right), showing deployment restrictions for parameter heavy MoEs and large dense models on edge devices.
  • Figure 2: Sparse Mixture-of-Experts architecture with Token Choice (TC) Routing and k=2
  • Figure 3: Feed Forward Layer: Standard (left) and Weight-Decomposed (right).
  • Figure 4: Example expert selection (for simplicity, k=1) for individual layers and the complete model.
  • Figure 5: Example expert replacements. 1 = Active Expert, 0 = Inactive Expert. Top: BlES, Bottom: MoE.
  • ...and 3 more figures