Table of Contents
Fetching ...

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

Mohammed Al-Maamari, Mehdi Ben Amor, Michael Granitzer

TL;DR

The paper tackles efficient, modular multilingual language modeling by marrying Knowledge Distillation (KD) with Mixture of Experts (MoE) to achieve language- and task-specific specialization while curbing catastrophic forgetting. It distills a GPT-2 Medium teacher into smaller student models and uses a language-class router to route inputs to language-specific experts, formalized by a total loss $L_{total} = \alpha L_{LM} + \beta L_{KD}$ with adaptive variants explored. The study compares three MoE configurations (PLE, JEET, MoE-CE), demonstrates the router’s high classification accuracy, and shows that introducing a common expert can boost cross-language performance; importantly, the MoE setup mitigates forgetting compared to sequential learning. Open-source datasets, a dataset-creation tool, and code are provided to enable replication and extension, underscoring the practical impact for scalable, multilingual NLP with modularity and efficiency.

Abstract

This research combines Knowledge Distillation (KD) and Mixture of Experts (MoE) to develop modular, efficient multilingual language models. Key objectives include evaluating adaptive versus fixed alpha methods in KD and comparing modular MoE architectures for handling multi-domain inputs and preventing catastrophic forgetting. KD compresses large language models (LLMs) into smaller, efficient models, while MoE enhances modularity with specialized tasks. Experiments showed similar performance for both KD methods, with marginal improvements from adaptive alpha. A combined loss approach provided more stable learning. The router, trained to classify input sequences into English, French, German, or Python, achieved 99.95% precision, recall, and F1 score, with Logistic Regression being the most effective classifier. Evaluations of modular MoE architectures revealed that Pre-trained Language Experts (PLE) and Joint Expert Embedding Training (JEET) performed similarly, while the MoE with Common Expert (MoE-CE) setup showed slightly lower performance. Including a common expert in MoE-CE improved its performance. Studies on catastrophic forgetting indicated that sequential training led to significant forgetting, while single-session training with balanced batches and the MoE approach mitigated this issue. The MoE architecture preserved knowledge across multiple languages effectively. The research contributes open-sourced resources including the dataset (https://zenodo.org/doi/10.5281/zenodo.12677631), a balanced dataset creation tool (https://github.com/padas-lab-de/multi-language-dataset-creator), and the research codebase (https://github.com/ModMaamari/mixture-modular-experts).

Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language Models

TL;DR

The paper tackles efficient, modular multilingual language modeling by marrying Knowledge Distillation (KD) with Mixture of Experts (MoE) to achieve language- and task-specific specialization while curbing catastrophic forgetting. It distills a GPT-2 Medium teacher into smaller student models and uses a language-class router to route inputs to language-specific experts, formalized by a total loss with adaptive variants explored. The study compares three MoE configurations (PLE, JEET, MoE-CE), demonstrates the router’s high classification accuracy, and shows that introducing a common expert can boost cross-language performance; importantly, the MoE setup mitigates forgetting compared to sequential learning. Open-source datasets, a dataset-creation tool, and code are provided to enable replication and extension, underscoring the practical impact for scalable, multilingual NLP with modularity and efficiency.

Abstract

This research combines Knowledge Distillation (KD) and Mixture of Experts (MoE) to develop modular, efficient multilingual language models. Key objectives include evaluating adaptive versus fixed alpha methods in KD and comparing modular MoE architectures for handling multi-domain inputs and preventing catastrophic forgetting. KD compresses large language models (LLMs) into smaller, efficient models, while MoE enhances modularity with specialized tasks. Experiments showed similar performance for both KD methods, with marginal improvements from adaptive alpha. A combined loss approach provided more stable learning. The router, trained to classify input sequences into English, French, German, or Python, achieved 99.95% precision, recall, and F1 score, with Logistic Regression being the most effective classifier. Evaluations of modular MoE architectures revealed that Pre-trained Language Experts (PLE) and Joint Expert Embedding Training (JEET) performed similarly, while the MoE with Common Expert (MoE-CE) setup showed slightly lower performance. Including a common expert in MoE-CE improved its performance. Studies on catastrophic forgetting indicated that sequential training led to significant forgetting, while single-session training with balanced batches and the MoE approach mitigated this issue. The MoE architecture preserved knowledge across multiple languages effectively. The research contributes open-sourced resources including the dataset (https://zenodo.org/doi/10.5281/zenodo.12677631), a balanced dataset creation tool (https://github.com/padas-lab-de/multi-language-dataset-creator), and the research codebase (https://github.com/ModMaamari/mixture-modular-experts).
Paper Structure (41 sections, 1 equation, 10 figures, 6 tables)

This paper contains 41 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Dataset Splits for Each Language
  • Figure 2: Learning Curve for Mistral 1.6B, Mistral GPT-M 440M, Phi 1.34B
  • Figure 3: Our Knowledge Distillation Process
  • Figure 4: Architecture of the Joint Expert Embedding Training MoE Setup
  • Figure 5: Architecture of the MoE with Common Expert Setup
  • ...and 5 more figures