A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation
Francois Meyer, Jan Buys
TL;DR
This paper investigates how subword segmentation affects multilingual MT and cross-lingual transfer, using English-to-Siswati and related South African languages to compare five subword methods. Through two experimental tracks—multilingual MT and cross-lingual finetuning—across diverse linguistic typologies, it demonstrates that subword regularisation via ULM enhances synergy in multilingual settings, whereas BPE facilitates transfer during finetuning. The study also shows that orthographic word boundary conventions can impede cross-lingual transfer more than linguistic relatedness, underscoring orthography as a key factor in multilingual modelling. Practically, the findings guide practitioners to tailor subword strategies: adopt ULM for multilingual synergy and BPE for cross-lingual transfer, while paying close attention to orthographic differences between languages.
Abstract
Multilingual modelling can improve machine translation for low-resource languages, partly through shared subword representations. This paper studies the role of subword segmentation in cross-lingual transfer. We systematically compare the efficacy of several subword methods in promoting synergy and preventing interference across different linguistic typologies. Our findings show that subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning. Notably, our results suggest that differences in orthographic word boundary conventions (the morphological granularity of written words) may impede cross-lingual transfer more significantly than linguistic unrelatedness. Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.
