Table of Contents
Fetching ...

Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

Salvador Carrión, Francisco Casacuberta

TL;DR

This work tackles continual learning in neural machine translation by adopting Low-Rank Adaptation (LoRA) to enable parameter-efficient fine-tuning across languages and domains, and by introducing a gradient-based regularization scheme targeted at low-rank decomposition matrices to mitigate catastrophic forgetting. It further proposes an interactive, gate-free Mixture of LoRA Experts (MoLE) approach for real-time domain adaptation without retraining. Empirical results show LoRA achieves substantial performance with a small parameter footprint, interactive LoRA combinations can boost cross-domain performance, and the proposed regularization helps preserve prior knowledge while incorporating new tasks. The combination offers a scalable, interactive framework for continual NMT with practical implications for multilingual and multi-domain translation systems.

Abstract

Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.

Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

TL;DR

This work tackles continual learning in neural machine translation by adopting Low-Rank Adaptation (LoRA) to enable parameter-efficient fine-tuning across languages and domains, and by introducing a gradient-based regularization scheme targeted at low-rank decomposition matrices to mitigate catastrophic forgetting. It further proposes an interactive, gate-free Mixture of LoRA Experts (MoLE) approach for real-time domain adaptation without retraining. Empirical results show LoRA achieves substantial performance with a small parameter footprint, interactive LoRA combinations can boost cross-domain performance, and the proposed regularization helps preserve prior knowledge while incorporating new tasks. The combination offers a scalable, interactive framework for continual NMT with practical implications for multilingual and multi-domain translation systems.

Abstract

Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.

Paper Structure

This paper contains 13 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Theorized model space: If we consider a model as a point in a hyperdimensional space resulting from the data distribution on which it was trained, we can expect that another model trained on a different data distribution, but similar, could end up in the shared region defined by these subspaces. Therefore, to mitigate the catastrophic forgetting phenomenon we need to find a way to move the model into that overlapping region with the available information. (The arrows indicate the potential path for the $f_{A, B}$ model; in green the arrow that passes across the overlapping area, and in black the arrow that skips this area).
  • Figure 2: Comparative analysis of the model performance w.r.t the LoRA-rank and the number of parameters used: This figure shows the performance improvements (BLEU scores) achieved through a low-rank fine-tuning strategy across various domains (i.e. Health, Biological, Legal, and Europarl). As we can see, the LoRA fine-tuning strategy defines a regime of logarithmic performance improvements so that we can use a minimal fraction of parameters to achieve a performance similar to a traditional full-parameter fine-tuning. Note: M = millions.
  • Figure 3: Efficient performance boosting for multilingual NMT models: This figure demonstrates the effectiveness of low-rank adaptations in improving the performance of a multilingual NMT model trained on limited data (25k sentences per language). The BLEU score improvements for each language pair show significant gains with minimal parameters, following a logarithmic trend until the performance matches the full-parameter fine-tuning approach. (M = millions).
  • Figure 4: Incorporating new unseen languages into a pre-trained multilingual NMT model: This figure illustrates the effectiveness of extending a multilingual NMT model to include new language-pairs (en-it and en-pt), completely from scratch, in a parameter-efficient manner, thus, demonstrating consistent performance improvements with an increased parameter utilization. (M = millions).
  • Figure 5: Comparative analysis of training times for low-rank and full-parameter fine-tuning approaches: The low-rank approach is characterized by its longer training times (solid lines) when compared to a full-parameter approach (dashed lines) due to the complexity of capturing complex nonlinear dependencies through a matrix factorization process. In this case, the low-rank approach used 44.26% (rank 256) of the full-parameter model so higher loss and training times are expected.
  • ...and 3 more figures