SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
Yuseon Choi, Sangjin Kim, Jungjun Oh, Gwangtae Park, Byeongcheol Kim, Hoi-Jun Yoo
TL;DR
MoE models offer large parameter capacity but are challenging to deploy on-device due to vast expert pools and costly offloading. SliceMoE combines Dynamic Bit-Sliced Caching, Calibration-Free Asymmetric Matryoshka Quantization, and Predictive Cache Warmup to achieve miss-rate-constrained, energy-efficient MoE inference with on-device memory hierarchies. The approach expands effective cache capacity, preserves high-bit accuracy through AMAT, and reduces early cold misses via phase-aware warmup, yielding substantial decode-energy and latency improvements on two large MoE benchmarks. This work enables practical, low-energy MoE deployments in constrained hardware environments without requiring extensive model modification or calibration.
Abstract
MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.
