UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages
Trinh Pham, Khoi M. Le, Luu Anh Tuan
TL;DR
This work addresses cross lingual transfer learning for languages with limited resources by proposing UniBridge, a unified framework that optimizes embeddings and vocabulary for unseen languages. It introduces a vocabulary size search driven by Average Log Probability and a language specific embedding initialization that combines lexical overlap and semantic alignment, along with a KL divergence regularizer during adapter based language adaptation. It further introduces a multi source transfer inference that computes harmony weights across source languages to ensemble predictions. Across WikiANN, UD and AmericasNLI benchmarks, UniBridge consistently outperforms baselines such as MAD-X and full model fine tuning, demonstrating the value of dynamic vocabularies and multi source knowledge integration. The results highlight practical benefits for enabling robust cross lingual transfer in low resource languages.
Abstract
In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with Optimized Embeddings and Vocabulary), a comprehensive approach developed to improve the effectiveness of Cross-Lingual Transfer Learning, particularly in languages with limited resources. Our approach tackles two essential elements of a language model: the initialization of embeddings and the optimal vocabulary size. Specifically, we propose a novel embedding initialization method that leverages both lexical and semantic alignment for a language. In addition, we present a method for systematically searching for the optimal vocabulary size, ensuring a balance between model complexity and linguistic coverage. Our experiments across multilingual datasets show that our approach greatly improves the F1-Score in several languages. UniBridge is a robust and adaptable solution for cross-lingual systems in various languages, highlighting the significance of initializing embeddings and choosing the right vocabulary size in cross-lingual environments.
