Table of Contents
Fetching ...

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

Yuseon Choi, Sangjin Kim, Jungjun Oh, Gwangtae Park, Byeongcheol Kim, Hoi-Jun Yoo

TL;DR

MoE models offer large parameter capacity but are challenging to deploy on-device due to vast expert pools and costly offloading. SliceMoE combines Dynamic Bit-Sliced Caching, Calibration-Free Asymmetric Matryoshka Quantization, and Predictive Cache Warmup to achieve miss-rate-constrained, energy-efficient MoE inference with on-device memory hierarchies. The approach expands effective cache capacity, preserves high-bit accuracy through AMAT, and reduces early cold misses via phase-aware warmup, yielding substantial decode-energy and latency improvements on two large MoE benchmarks. This work enables practical, low-energy MoE deployments in constrained hardware environments without requiring extensive model modification or calibration.

Abstract

MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference

TL;DR

MoE models offer large parameter capacity but are challenging to deploy on-device due to vast expert pools and costly offloading. SliceMoE combines Dynamic Bit-Sliced Caching, Calibration-Free Asymmetric Matryoshka Quantization, and Predictive Cache Warmup to achieve miss-rate-constrained, energy-efficient MoE inference with on-device memory hierarchies. The approach expands effective cache capacity, preserves high-bit accuracy through AMAT, and reduces early cold misses via phase-aware warmup, yielding substantial decode-energy and latency improvements on two large MoE benchmarks. This work enables practical, low-energy MoE deployments in constrained hardware environments without requiring extensive model modification or calibration.

Abstract

MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.

Paper Structure

This paper contains 15 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: (a) On-premises MoE deployment under a single-batch execution scenario. (b) Miss penalty from Flash access and the conceptual execution flow of miss-rate–constrained MoE inference.
  • Figure 2: Previous cache-aware routing approach (Cache-Prior) and its limitations within our Region of Interest (RoI) under miss-rate constraints for energy-efficient MoE inference.
  • Figure 3: Phase-wise statistics of expert selection frequencies during Prefill and early Decode.
  • Figure 4: Motivation and overall execution flow of proposed Dynamic Bit-Sliced Caching (DBSC)
  • Figure 5: Conventional multi-bitwidth precision methods and the proposed calibration-free Asymmetric Matryoshka Quantization (AMAT).
  • ...and 5 more figures