MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

Zehua Liu; Han Wu; Ruifeng She; Xiaojin Fu; Xiongwei Han; Tao Zhong; Mingxuan Yuan

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, Mingxuan Yuan

TL;DR

MoLAE introduces a latent-space factorization for Mixture of Experts, replacing expert-wide FFN weights with shared latent projections and expert-specific transforms to dramatically reduce parameters and compute while preserving performance. The authors provide a mathematically grounded framework to transform pre-trained MoE models into MoLAE using rank reduction and SVD-based matrix factorization, with a thorough analysis of optimal factorization conditions. Empirically, MoLAE matches or closely approaches standard MoE performance on downstream tasks and during GPT-2–scale pretraining, while achieving substantial parameter efficiency (e.g., ~40% fewer non-embedding parameters) and reduced communication overhead. The work demonstrates that a carefully designed latent-space approach can maintain model capability at scale, enabling more economical deployment of large language models, and outlines a path for extending latent-space adaptations to other transformer components.

Abstract

Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

TL;DR

Abstract

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)