CryptoMoE: Privacy-Preserving and Scalable Mixture of Experts Inference via Balanced Expert Routing
Yifan Zhou, Tianshi Xu, Jue Hong, Ye Wu, Meng Li
TL;DR
CryptoMoE tackles privacy in MoE-based LLM inference by enforcing Inference-Time Balanced Expert Routing, which caps each expert at $t$ tokens to conceal routing decisions. It couples a Confidence-Aware Secure Dispatch protocol with a lightweight Secure Combine protocol and a Batch Ciphertext-Plaintext MatMul to protect routing information while accelerating computation. Across DeepSeekMoE, QWenMoE, and OLMoE, it achieves end-to-end latency reductions of $2.8\\sim3.5\\times$ and communication reductions of $2.9\\sim4.3\\times$ with around $99.2\\%$ of the original accuracy, and it complements CipherPrune pruning strategies for MoE. The work provides a practical, open-source framework for private MoE inference and demonstrates substantial gains over dense baselines, enabling privacy-preserving deployment of large MoE models.
Abstract
Private large language model (LLM) inference based on cryptographic primitives offers a promising path towards privacy-preserving deep learning. However, existing frameworks only support dense LLMs like LLaMA-1 and struggle to scale to mixture-of-experts (MoE) architectures. The key challenge comes from securely evaluating the dynamic routing mechanism in MoE layers, which may reveal sensitive input information if not fully protected. In this paper, we propose CryptoMoE, the first framework that enables private, efficient, and accurate inference for MoE-based models. CryptoMoE balances expert loads to protect expert routing information and proposes novel protocols for secure expert dispatch and combine. CryptoMoE also develops a confidence-aware token selection strategy and a batch matrix multiplication protocol to improve accuracy and efficiency further. Extensive experiments on DeepSeekMoE-16.4B, OLMoE-6.9B, and QWenMoE-14.3B show that CryptoMoE achieves $2.8\sim3.5\times$ end-to-end latency reduction and $2.9\sim4.3\times$ communication reduction over a dense baseline with minimum accuracy loss. We also adapt CipherPrune (ICLR'25) for MoE inference and demonstrate CryptoMoE can reduce the communication by up to $4.3 \times$. Code is available at: https://github.com/PKU-SEC-Lab/CryptoMoE.
