MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
Arian Raje, Anupam Nayak, Gauri Joshi
TL;DR
MELINOE tackles memory-constrained MoE inference by finetuning routers to concentrate activations on a small, per-sequence set of experts and then prefetching those experts before decoding. The method combines a cache-aware fine-tuning objective with a lightweight activation predictor, enabling proactive GPU caching and reducing CPU–GPU transfers. Experimental results show substantial throughput gains across multiple backbones and hardware without compromising downstream accuracy or Rouge-L scores, demonstrating practical deployment benefits for resource-limited MoE systems. MELINOE thus provides a robust, deployment-friendly path to scalable MoE inference that complements existing offloading and quantization techniques.
Abstract
Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by $1.2-3\times$ over efficient baselines and up to $14.7\times$ over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.
