Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation
Zhe Cao, Zhi Qu, Hidetaka Kamigaito, Taro Watanabe
TL;DR
This paper shows that fine-tuning multilingual NMT models can be confined to intrinsic language-specific subspaces, enabling substantial parameter efficiency. It introduces Language-Specific LoRA (LSLo) to model per-language subspaces with sparse routing, coupled with unstructured pruning and architecture-learning techniques to exhaustively search minimal subspaces. Empirical results on FLORES-101 subsets demonstrate up to $2.25$ spBLEU gains over full fine-tuning while using as little as $7\%$ trainable parameters for 30 languages; high-resource languages are especially improved when subspaces are reduced. The work also provides guidance on where to place LSLo (primarily in fully connected layers) and how to allocate subspaces across languages, highlighting the benefits of cross-language transfer and resource-aware fine-tuning. Overall, LSLo offers a scalable, efficient approach for fine-tuning hundreds of languages in a single multilingual model, with implications for deployment and resource-constrained settings.
Abstract
Multilingual neural machine translation models support fine-tuning hundreds of languages simultaneously. However, fine-tuning on full parameters solely is inefficient potentially leading to negative interactions among languages. In this work, we demonstrate that the fine-tuning for a language occurs in its intrinsic language-specific subspace with a tiny fraction of entire parameters. Thus, we propose language-specific LoRA to isolate intrinsic language-specific subspaces. Furthermore, we propose architecture learning techniques and introduce a gradual pruning schedule during fine-tuning to exhaustively explore the optimal setting and the minimal intrinsic subspaces for each language, resulting in a lightweight yet effective fine-tuning procedure. The experimental results on a 12-language subset and a 30-language subset of FLORES-101 show that our methods not only outperform full-parameter fine-tuning up to 2.25 spBLEU scores but also reduce trainable parameters to $0.4\%$ for high and medium-resource languages and $1.6\%$ for low-resource ones.
