Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

Mohammed Al-Maamari; Mehdi Ben Amor; Michael Granitzer

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

Mohammed Al-Maamari, Mehdi Ben Amor, Michael Granitzer

TL;DR

The paper tackles efficient, modular multilingual language modeling by marrying Knowledge Distillation (KD) with Mixture of Experts (MoE) to achieve language- and task-specific specialization while curbing catastrophic forgetting. It distills a GPT-2 Medium teacher into smaller student models and uses a language-class router to route inputs to language-specific experts, formalized by a total loss $L_{total} = \alpha L_{LM} + \beta L_{KD}$ with adaptive variants explored. The study compares three MoE configurations (PLE, JEET, MoE-CE), demonstrates the router’s high classification accuracy, and shows that introducing a common expert can boost cross-language performance; importantly, the MoE setup mitigates forgetting compared to sequential learning. Open-source datasets, a dataset-creation tool, and code are provided to enable replication and extension, underscoring the practical impact for scalable, multilingual NLP with modularity and efficiency.

Abstract

This research combines Knowledge Distillation (KD) and Mixture of Experts (MoE) to develop modular, efficient multilingual language models. Key objectives include evaluating adaptive versus fixed alpha methods in KD and comparing modular MoE architectures for handling multi-domain inputs and preventing catastrophic forgetting. KD compresses large language models (LLMs) into smaller, efficient models, while MoE enhances modularity with specialized tasks. Experiments showed similar performance for both KD methods, with marginal improvements from adaptive alpha. A combined loss approach provided more stable learning. The router, trained to classify input sequences into English, French, German, or Python, achieved 99.95% precision, recall, and F1 score, with Logistic Regression being the most effective classifier. Evaluations of modular MoE architectures revealed that Pre-trained Language Experts (PLE) and Joint Expert Embedding Training (JEET) performed similarly, while the MoE with Common Expert (MoE-CE) setup showed slightly lower performance. Including a common expert in MoE-CE improved its performance. Studies on catastrophic forgetting indicated that sequential training led to significant forgetting, while single-session training with balanced batches and the MoE approach mitigated this issue. The MoE architecture preserved knowledge across multiple languages effectively. The research contributes open-sourced resources including the dataset (https://zenodo.org/doi/10.5281/zenodo.12677631), a balanced dataset creation tool (https://github.com/padas-lab-de/multi-language-dataset-creator), and the research codebase (https://github.com/ModMaamari/mixture-modular-experts).

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

TL;DR

with adaptive variants explored. The study compares three MoE configurations (PLE, JEET, MoE-CE), demonstrates the router’s high classification accuracy, and shows that introducing a common expert can boost cross-language performance; importantly, the MoE setup mitigates forgetting compared to sequential learning. Open-source datasets, a dataset-creation tool, and code are provided to enable replication and extension, underscoring the practical impact for scalable, multilingual NLP with modularity and efficiency.

Abstract

Paper Structure (41 sections, 1 equation, 10 figures, 6 tables)

This paper contains 41 sections, 1 equation, 10 figures, 6 tables.

Introduction
Related Work
Knowledge Distillation
Mixture of Experts
Methods
Dataset Preparation
Tokenization
Teacher Model Training
Knowledge Distillation
Mixture of Experts Architecture
Router
Training and Inference
Results
Experimental Setup
Adaptive vs. Fixed Alpha
...and 26 more sections

Figures (10)

Figure 1: Dataset Splits for Each Language
Figure 2: Learning Curve for Mistral 1.6B, Mistral GPT-M 440M, Phi 1.34B
Figure 3: Our Knowledge Distillation Process
Figure 4: Architecture of the Joint Expert Embedding Training MoE Setup
Figure 5: Architecture of the MoE with Common Expert Setup
...and 5 more figures

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

TL;DR

Abstract

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)