Table of Contents
Fetching ...

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Tianhao Li, Shangjie Li, Binbin Xie, Deyi Xiong, Baosong Yang

TL;DR

This paper tackles the problem of extending large language models to low-resource languages without degrading performance on high-resource languages, a common pitfall of conventional continual training. It introduces MoE-CT, a paradigm that freezes the base LLM and augments it with a trainable Mixture-of-Experts (MoE) module alongside a fusion mechanism to integrate new multilingual knowledge with retained base knowledge. The approach demonstrates improved multilingual benchmarks and stronger resistance to forgetting compared with standard CT and LoRA-CT, while achieving data-efficient expansion (requiring less additional Chinese/English data). The method shows applicability across model sizes (e.g., Qwen-1b8 and Qwen-7b), offering a scalable, resource-conscious direction for inclusive language technologies and future continual-learning research in LLMs.

Abstract

The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model's original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model's original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model's integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

TL;DR

This paper tackles the problem of extending large language models to low-resource languages without degrading performance on high-resource languages, a common pitfall of conventional continual training. It introduces MoE-CT, a paradigm that freezes the base LLM and augments it with a trainable Mixture-of-Experts (MoE) module alongside a fusion mechanism to integrate new multilingual knowledge with retained base knowledge. The approach demonstrates improved multilingual benchmarks and stronger resistance to forgetting compared with standard CT and LoRA-CT, while achieving data-efficient expansion (requiring less additional Chinese/English data). The method shows applicability across model sizes (e.g., Qwen-1b8 and Qwen-7b), offering a scalable, resource-conscious direction for inclusive language technologies and future continual-learning research in LLMs.

Abstract

The advent of large language models (LLMs) has predominantly catered to high-resource languages, leaving a disparity in performance for low-resource languages. Conventional Continual Training (CT) approaches to bridge this gap often undermine a model's original linguistic proficiency when expanding to multilingual contexts. Addressing this issue, we introduce a novel MoE-CT architecture, a paradigm that innovatively separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency. Our approach significantly outperforms conventional CT methods, as evidenced by our experiments, which show marked improvements in multilingual benchmarks without sacrificing the model's original language performance. Moreover, our MoE-CT framework demonstrates enhanced resistance to forgetting and superior transfer learning capabilities. By preserving the base model's integrity and focusing on strategic parameter expansion, our methodology advances multilingual language modeling and represents a significant step forward for low-resource language inclusion in LLMs, indicating a fruitful direction for future research in language technologies.
Paper Structure (20 sections, 4 equations, 2 figures, 10 tables)

This paper contains 20 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: The abscissa represents the number of tokens from the original data incorporated during the Continual Training (CT) process. Although an increased volume of original data may decelerate the model's forgetting, it can significantly impede the enhancement of multilingual capabilities.
  • Figure 2: The diagram on the right represents the training structure of MoE-CT, where the blue area indicates that parameters are frozen, and the yellow area indicates that parameters are trainable. The parameters for all experts and the shared feed-forward network (shared-ffn) are initialized from the feed-forward network (ffn) of the original model.