Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

Zhe Cao; Zhi Qu; Hidetaka Kamigaito; Taro Watanabe

Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

Zhe Cao, Zhi Qu, Hidetaka Kamigaito, Taro Watanabe

TL;DR

This paper shows that fine-tuning multilingual NMT models can be confined to intrinsic language-specific subspaces, enabling substantial parameter efficiency. It introduces Language-Specific LoRA (LSLo) to model per-language subspaces with sparse routing, coupled with unstructured pruning and architecture-learning techniques to exhaustively search minimal subspaces. Empirical results on FLORES-101 subsets demonstrate up to $2.25$ spBLEU gains over full fine-tuning while using as little as $7\%$ trainable parameters for 30 languages; high-resource languages are especially improved when subspaces are reduced. The work also provides guidance on where to place LSLo (primarily in fully connected layers) and how to allocate subspaces across languages, highlighting the benefits of cross-language transfer and resource-aware fine-tuning. Overall, LSLo offers a scalable, efficient approach for fine-tuning hundreds of languages in a single multilingual model, with implications for deployment and resource-constrained settings.

Abstract

Multilingual neural machine translation models support fine-tuning hundreds of languages simultaneously. However, fine-tuning on full parameters solely is inefficient potentially leading to negative interactions among languages. In this work, we demonstrate that the fine-tuning for a language occurs in its intrinsic language-specific subspace with a tiny fraction of entire parameters. Thus, we propose language-specific LoRA to isolate intrinsic language-specific subspaces. Furthermore, we propose architecture learning techniques and introduce a gradual pruning schedule during fine-tuning to exhaustively explore the optimal setting and the minimal intrinsic subspaces for each language, resulting in a lightweight yet effective fine-tuning procedure. The experimental results on a 12-language subset and a 30-language subset of FLORES-101 show that our methods not only outperform full-parameter fine-tuning up to 2.25 spBLEU scores but also reduce trainable parameters to $0.4\%$ for high and medium-resource languages and $1.6\%$ for low-resource ones.

Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

TL;DR

spBLEU gains over full fine-tuning while using as little as

trainable parameters for 30 languages; high-resource languages are especially improved when subspaces are reduced. The work also provides guidance on where to place LSLo (primarily in fully connected layers) and how to allocate subspaces across languages, highlighting the benefits of cross-language transfer and resource-aware fine-tuning. Overall, LSLo offers a scalable, efficient approach for fine-tuning hundreds of languages in a single multilingual model, with implications for deployment and resource-constrained settings.

Abstract

for high and medium-resource languages and

for low-resource ones.

Paper Structure (30 sections, 6 equations, 6 figures, 10 tables)

This paper contains 30 sections, 6 equations, 6 figures, 10 tables.

Introduction
Background
Methodology
Language-specific LoRA
Unstructured Pruning
Architecture Learning
Weight Learning
Intrinsic Subspace Estimation
Experimental Setup
Dataset
Model Setting
Training
Evaluation
Results
Weight Learning
...and 15 more sections

Figures (6)

Figure 1: Source (src) and target (tgt) weights learned across layers in encoder (enc) and decoder (dec). The model's focus shifted from the source side to the target side near the top of the encoder.
Figure 2: Illustration of the parameter space demands for each language, averaged across all layers. Color indicates the demands from low (blue) to high (red). Rows are organized by language resource type: high-resource (green), medium-resource (blue), and very-low-resource (red). Columns are organized by weight matrices in the encoder and decoder: query, key, and value matrices of attention (q, k, v) and cross-attention (c-q, c-k, c-v); down and up matrices of MLP (fc1, fc2).
Figure 3: We examined the performance of H2H and V2V directions per epoch. H2H performance declined during training.
Figure 4: Illustration of the parameter space demands for each weight matrix, averaged across all languages. Color indicates the demands from low (blue) to high (red). Columns are organized by weight matrices in the encoder and decoder: query, key, and value matrices of attention (q, k, v) and cross-attention (c-q, c-k, c-v); down and up matrices of MLP (fc1, fc2).
Figure 5: The parameter space demands for each language in all 12 layers of encoder and decoder respectively. Red color means a higher demand. We can see a clear tendency that very-low-resource languages require more parameters during fine-tuning.
...and 1 more figures

Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

TL;DR

Abstract

Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)