Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach
Salvador Carrión, Francisco Casacuberta
TL;DR
This work tackles continual learning in neural machine translation by adopting Low-Rank Adaptation (LoRA) to enable parameter-efficient fine-tuning across languages and domains, and by introducing a gradient-based regularization scheme targeted at low-rank decomposition matrices to mitigate catastrophic forgetting. It further proposes an interactive, gate-free Mixture of LoRA Experts (MoLE) approach for real-time domain adaptation without retraining. Empirical results show LoRA achieves substantial performance with a small parameter footprint, interactive LoRA combinations can boost cross-domain performance, and the proposed regularization helps preserve prior knowledge while incorporating new tasks. The combination offers a scalable, interactive framework for continual NMT with practical implications for multilingual and multi-domain translation systems.
Abstract
Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
