CoSMoEs: Compact Sparse Mixture of Experts
Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed Aly, Adithya Sagar
TL;DR
CoSMoEs tackles the challenge of deploying sparse Mixture-of-Experts on edge devices by introducing weight-decomposed experts and a block-wise expert selection loss to reduce offloads. The approach yields a clear quality advantage over fair-dact dense baselines (at least 2.35% absolute, with WD adding up to 1.1%), while BlES dramatically improves memory efficiency and generation latency (6x fewer expert replacements and ~50% faster generation). Training-time innovations deliver 5–10x efficiency versus dense models, enabling on-device deployment at phone and wearable scales. Overall, CoSMoEs demonstrates how compact MoEs can achieve high-quality, privacy-preserving on-device inference with practical memory/latency trade-offs.
Abstract
Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.
