Table of Contents
Fetching ...

UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages

Trinh Pham, Khoi M. Le, Luu Anh Tuan

TL;DR

This work addresses cross lingual transfer learning for languages with limited resources by proposing UniBridge, a unified framework that optimizes embeddings and vocabulary for unseen languages. It introduces a vocabulary size search driven by Average Log Probability and a language specific embedding initialization that combines lexical overlap and semantic alignment, along with a KL divergence regularizer during adapter based language adaptation. It further introduces a multi source transfer inference that computes harmony weights across source languages to ensemble predictions. Across WikiANN, UD and AmericasNLI benchmarks, UniBridge consistently outperforms baselines such as MAD-X and full model fine tuning, demonstrating the value of dynamic vocabularies and multi source knowledge integration. The results highlight practical benefits for enabling robust cross lingual transfer in low resource languages.

Abstract

In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with Optimized Embeddings and Vocabulary), a comprehensive approach developed to improve the effectiveness of Cross-Lingual Transfer Learning, particularly in languages with limited resources. Our approach tackles two essential elements of a language model: the initialization of embeddings and the optimal vocabulary size. Specifically, we propose a novel embedding initialization method that leverages both lexical and semantic alignment for a language. In addition, we present a method for systematically searching for the optimal vocabulary size, ensuring a balance between model complexity and linguistic coverage. Our experiments across multilingual datasets show that our approach greatly improves the F1-Score in several languages. UniBridge is a robust and adaptable solution for cross-lingual systems in various languages, highlighting the significance of initializing embeddings and choosing the right vocabulary size in cross-lingual environments.

UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages

TL;DR

This work addresses cross lingual transfer learning for languages with limited resources by proposing UniBridge, a unified framework that optimizes embeddings and vocabulary for unseen languages. It introduces a vocabulary size search driven by Average Log Probability and a language specific embedding initialization that combines lexical overlap and semantic alignment, along with a KL divergence regularizer during adapter based language adaptation. It further introduces a multi source transfer inference that computes harmony weights across source languages to ensemble predictions. Across WikiANN, UD and AmericasNLI benchmarks, UniBridge consistently outperforms baselines such as MAD-X and full model fine tuning, demonstrating the value of dynamic vocabularies and multi source knowledge integration. The results highlight practical benefits for enabling robust cross lingual transfer in low resource languages.

Abstract

In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with Optimized Embeddings and Vocabulary), a comprehensive approach developed to improve the effectiveness of Cross-Lingual Transfer Learning, particularly in languages with limited resources. Our approach tackles two essential elements of a language model: the initialization of embeddings and the optimal vocabulary size. Specifically, we propose a novel embedding initialization method that leverages both lexical and semantic alignment for a language. In addition, we present a method for systematically searching for the optimal vocabulary size, ensuring a balance between model complexity and linguistic coverage. Our experiments across multilingual datasets show that our approach greatly improves the F1-Score in several languages. UniBridge is a robust and adaptable solution for cross-lingual systems in various languages, highlighting the significance of initializing embeddings and choosing the right vocabulary size in cross-lingual environments.
Paper Structure (26 sections, 13 equations, 4 figures, 23 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 4 figures, 23 tables, 1 algorithm.

Figures (4)

  • Figure 1: Some languages/scripts are not covered in the pre-trained corpora. Hence, the pre-trained tokenizer will eventually produce many unknown tokens which corrupts the sentence's meaning and results in poor performance.
  • Figure 2: Illustration of UniBridge: UniBridge represents an end-to-end framework for Cross-Lingual Transfer Learning. The framework encompasses various stages, including determining the appropriate vocabulary size, initializing language-specific embedding, adapting the model to new languages, and transferring task knowledge from multiple source languages. This approach aims to harness the power of a multilingual embedding space rather than relying on a single-source transfer language, such as English.
  • Figure 3: Mean F1-Score across various ALP thresholds.
  • Figure 4: Illustrations of subwords exhibiting similarity in both mBERT and XLM-R.