MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

Arian Raje; Anupam Nayak; Gauri Joshi

MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

Arian Raje, Anupam Nayak, Gauri Joshi

TL;DR

MELINOE tackles memory-constrained MoE inference by finetuning routers to concentrate activations on a small, per-sequence set of experts and then prefetching those experts before decoding. The method combines a cache-aware fine-tuning objective with a lightweight activation predictor, enabling proactive GPU caching and reducing CPU–GPU transfers. Experimental results show substantial throughput gains across multiple backbones and hardware without compromising downstream accuracy or Rouge-L scores, demonstrating practical deployment benefits for resource-limited MoE systems. MELINOE thus provides a robust, deployment-friendly path to scalable MoE inference that complements existing offloading and quantization techniques.

Abstract

Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by $1.2-3\times$ over efficient baselines and up to $14.7\times$ over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.

MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

TL;DR

Abstract

over efficient baselines and up to

over transfer-heavy baselines while retaining or even improving the performance of the model on a downstream task, making it a reliable method for improving MoE inference efficiency.

Paper Structure (46 sections, 3 theorems, 36 equations, 12 figures, 13 tables)

This paper contains 46 sections, 3 theorems, 36 equations, 12 figures, 13 tables.

Introduction
Problem Setup and Motivation
Expert Offloading Systems.
Expert Specialization.
Fine-Tuning to Achieve Cache-Friendliness.
Method
Pre-Deployment Stage
MoE Fine-Tuning Procedure
Cache Simulation Loss $\mathcal{L}_{cs}$.
Rank Matching Loss $\mathcal{L}_{rm}$.
Expert Activation Predictor
Post-Deployment Stage
Results
Experimental Setup
Models and Datasets.
...and 31 more sections

Key Result

Proposition C.3

Using the definitions in Equations eq:countdef and eq:cachedef, we have define $Z^{(t)} = \gamma^{t-1} +\frac{K}{C}\sum_{i=1}^{t-1}\gamma^{t-i-1}$. Then, by Equation eq:cachedef, one can recursively update the soft cache state as

Figures (12)

Figure 1: OLMoE transfer behavior and routing concentration before vs. after fine-tuning.
Figure 2: Overview of melinoe. Pre-deployment: fine-tune the model for per-sequence routing locality and train an activation predictor. Post-deployment: predict likely experts, preload a GPU-resident cache, and run offloaded inference with fewer CPU-GPU transfers.
Figure 3: Throughput comparison of melinoe against prior baselines across model/dataset/GPU configurations.
Figure 4: Impact of varying $\lambda_{cs}$ and $\lambda_{rm}$ on number of expert transfers and model performance (OLMoE, 64 output tokens).
Figure 6: Throughput of baselines with various output lengths using OLMoE on the H100 setup with 3GB of VRAM.
...and 7 more figures

Theorems & Definitions (12)

Definition C.1: $\gamma$ cache eviction
Remark C.2
Proposition C.3
proof
Lemma C.4
proof
Remark C.5
Remark C.6
Definition C.7
Lemma C.8
...and 2 more

MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

TL;DR

Abstract

MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (12)