Table of Contents
Fetching ...

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, Mingxuan Yuan

TL;DR

MoLAE introduces a latent-space factorization for Mixture of Experts, replacing expert-wide FFN weights with shared latent projections and expert-specific transforms to dramatically reduce parameters and compute while preserving performance. The authors provide a mathematically grounded framework to transform pre-trained MoE models into MoLAE using rank reduction and SVD-based matrix factorization, with a thorough analysis of optimal factorization conditions. Empirically, MoLAE matches or closely approaches standard MoE performance on downstream tasks and during GPT-2–scale pretraining, while achieving substantial parameter efficiency (e.g., ~40% fewer non-embedding parameters) and reduced communication overhead. The work demonstrates that a carefully designed latent-space approach can maintain model capability at scale, enabling more economical deployment of large language models, and outlines a path for extending latent-space adaptations to other transformer components.

Abstract

Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

TL;DR

MoLAE introduces a latent-space factorization for Mixture of Experts, replacing expert-wide FFN weights with shared latent projections and expert-specific transforms to dramatically reduce parameters and compute while preserving performance. The authors provide a mathematically grounded framework to transform pre-trained MoE models into MoLAE using rank reduction and SVD-based matrix factorization, with a thorough analysis of optimal factorization conditions. Empirically, MoLAE matches or closely approaches standard MoE performance on downstream tasks and during GPT-2–scale pretraining, while achieving substantial parameter efficiency (e.g., ~40% fewer non-embedding parameters) and reduced communication overhead. The work demonstrates that a carefully designed latent-space approach can maintain model capability at scale, enabling more economical deployment of large language models, and outlines a path for extending latent-space adaptations to other transformer components.

Abstract

Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.

Paper Structure

This paper contains 24 sections, 2 theorems, 20 equations, 3 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Given matrices $W^i \in \mathbb{R}^{m \times n}$ with $m \leq n$, there exist matrices $A^i \in \mathbb{R}^{m \times m}$ and a common matrix $B \in \mathbb{R}^{m \times n}$ such that $A^i B = W^i$ for all $i \in \{1,2,\ldots,N\}$, if and only if there exists an $(n - m)$-dimensional subspace $K \sub

Figures (3)

  • Figure 1: Architectural comparison between MoE and MoLAE in the FFN layer. In both diagrams, $N_r$ denotes the number of routed experts. MoLAE extends the conventional MoE architecture by introducing latent mappings $B_{\text{up}}$, $B_{\text{gate}}$, and $B_{\text{down}}$ that capture shared information across experts. Expert-specific information is encapsulated in the mappings $A_{\text{up}}^i$, $A_{\text{down}}^i$, and $A_{\text{gate}}^i$ for each expert $i$.
  • Figure 2: Ablation study of group size $k$ on the Qwen1.5-MoE model.
  • Figure 3: Comparison of training loss curves between MoE and MoLAE models on the English Wikipedia dataset.

Theorems & Definitions (4)

  • Theorem 1
  • Proof 1
  • Theorem 2
  • Proof 2