Table of Contents
Fetching ...

LaDiMo: Layer-wise Distillation Inspired MoEfier

Sungyoon Kim, Youngjun Kim, Kihyo Moon, Minsung Jang

TL;DR

LaDiMo addresses the cost of scaling dense LLMs by converting a non-MoE Transformer into an MoE through a two-stage process: layer-wise expert construction and routing policy decision. It leverages layer-wise Knowledge Distillation to train MoE blocks to mimic FFN behavior, augmented with an adaptive, training-free router that determines per-layer routing strategies to balance accuracy and latency. The approach is demonstrated on LLaMA2-7B, achieving substantial parameter reduction and improved throughput with limited data, while preserving key task performance such as MMLU. This work provides a practical pathway for deploying inference-efficient MoE models in resource-constrained environments.

Abstract

The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and rapidly recover its performance. Furthermore, we develop an adaptive router that optimizes inference efficiency by profiling the distribution of routing weights and determining a layer-wise policy that balances accuracy and latency. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens, reducing activated parameters by over 20% while keeping accuracy. Our approach offers a flexible and efficient solution for building and deploying MoE models.

LaDiMo: Layer-wise Distillation Inspired MoEfier

TL;DR

LaDiMo addresses the cost of scaling dense LLMs by converting a non-MoE Transformer into an MoE through a two-stage process: layer-wise expert construction and routing policy decision. It leverages layer-wise Knowledge Distillation to train MoE blocks to mimic FFN behavior, augmented with an adaptive, training-free router that determines per-layer routing strategies to balance accuracy and latency. The approach is demonstrated on LLaMA2-7B, achieving substantial parameter reduction and improved throughput with limited data, while preserving key task performance such as MMLU. This work provides a practical pathway for deploying inference-efficient MoE models in resource-constrained environments.

Abstract

The advent of large language models has revolutionized natural language processing, but their increasing complexity has led to substantial training costs, resource demands, and environmental impacts. In response, sparse Mixture-of-Experts (MoE) models have emerged as a promising alternative to dense models. Since training MoE models from scratch can be prohibitively expensive, recent studies have explored leveraging knowledge from pre-trained non-MoE models. However, existing approaches have limitations, such as requiring significant hardware resources and data. We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. LaDiMo consists of two stages: layer-wise expert construction and routing policy decision. By harnessing the concept of Knowledge Distillation, we compress the model and rapidly recover its performance. Furthermore, we develop an adaptive router that optimizes inference efficiency by profiling the distribution of routing weights and determining a layer-wise policy that balances accuracy and latency. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens, reducing activated parameters by over 20% while keeping accuracy. Our approach offers a flexible and efficient solution for building and deploying MoE models.
Paper Structure (19 sections, 7 equations, 10 figures, 1 table)

This paper contains 19 sections, 7 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: The main framework of Layer-wise Distillation Inspired MoEfier. The MoE block has its gating router and experts, which are FFNs whose weights are initialized by splitting the reference FFN's weight matrices. Input $x$ is obtained during inference tasks on a small text dataset. Those gathered inputs are used as the dataset for training the MoE block. Additionally, we have applied auxiliary loss and adaptive router, which will be explained in Sections \ref{['subsec:auxiliary-loss']} and \ref{['subsec:adaptive-router']}.
  • Figure 2: Continued pre-training gives a smaller loss than starting from random initial weights. The experiment was conducted under NVIDIA A100 single GPU with the Chatbot Instruction Prompts dataset chatbot and the LLaMA2-7B model.
  • Figure 3: Training loss $\mathcal{L}_{\text{mse}}$ for each layer's MoEfier.
  • Figure 4: Effects of changes in the number of MoEfied layers into throughputs and accuracies for LLaMA-2 7B model.
  • Figure 5: Relation between the accuracies and the MoEfied single layer. The $x$ axis refers to the layer index from 0 to 31, and additionally, the vanilla LLaMA-2 7B model's accuracies are plotted at $x=32$.
  • ...and 5 more figures