Table of Contents
Fetching ...

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang Wang

TL;DR

This paper proposes a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to"upcycling"generalist MoEs), avoiding the high costs of ground-up training, and outperforms other popular open-source dense models of similar scales.

Abstract

The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: https://github.com/VITA-Group/READ-ME.

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

TL;DR

This paper proposes a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to"upcycling"generalist MoEs), avoiding the high costs of ground-up training, and outperforms other popular open-source dense models of similar scales.

Abstract

The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: https://github.com/VITA-Group/READ-ME.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of Read-ME. This figure shows the refactoring of a pre-trained dense model (in yellow) into two experts (in red and green). After refactoring, the model is deployed, and the serving timeline is depicted. At time $t=0$, multiple inference requests (each a sequence of tokens) are queued, with expert assignment for each token undecided ("$?$") until processed by the router. Our router pre-gates tokens before inference, enabling expert-aware batching. Tokens are routed to their respective experts and batched accordingly: at $t=0$ for Expert 1 (red) and at $t=1$ for Expert 2 (green). New tokens enter the queue at each time step, with routing computed only for incoming tokens marked "$?$".
  • Figure 2: (a) Visualization of transition matrix between the ($l$-$1$)-th layer and the $l$-th layer, where each coordinate $[\{s,t\}, \{i,j\}]$ represents $P(\mathcal{S}^{(l)}=\{i, j\} | \mathcal{S}^{(l-1)}=\{s, t\})$. The row-wise sparse pattern suggests that the router decision becomes almost deterministic given the previous layer's decision. (b) Mutual information $I(\mathcal{S}^{(l)}; \mathcal{S}^{(l-1)})$, which indicates the learned knowledge shared by two neighboring layers is high. (c) Overview figure of router tuning and router distillation loss.
  • Figure 3: Challenges of MoE serving in current serving systems and Read-ME's batching pipeline.
  • Figure 4: Evaluation of Read-ME on MMLU hendrycks2020measuring benchmark, compared to other open-source models and compression techniques ( performance numbers are collected from their respective papers.
  • Figure 4: Cache hit ratio measured in batched inference setup.
  • ...and 3 more figures