Table of Contents
Fetching ...

A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs

Vaibhav Singh, Amrith Krishna, Karthika NJ, Ganesh Ramakrishnan

TL;DR

The paper investigates three cross-lingual adaptation strategies—Handholding, Masquerading, and Bridging—for adapting an English-centric Llama-2-7b-chat model to Bengali, Hindi, and Tamil under low-resource constraints. By framing slot filling and NER as text-to-text tasks and evaluating with ICL and PEFT, it demonstrates that Handholding (using English supervision) and Bridging (Hindi continual pre-training) yield the strongest improvements, while Masquerading offers limited benefit, particularly under PEFT. The combination of Handholding and Bridging achieves the best overall performance, highlighting the value of leveraging a predominant language and a related language to enrich multilingual representation. These findings have practical implications for deploying LLMs in underrepresented languages, suggesting targeted pre-training and cross-lingual prompting as effective strategies when resources are scarce.

Abstract

Low-resource languages, by its very definition, tend to be under represented in the pre-training corpora of Large Language Models. In this work, we investigate three low-resource cross-lingual approaches that enable an LLM adapt to tasks in previously unseen languages. Llama-2 is an LLM where Indic languages, among many other language families, contribute to less than $0.005\%$ of the total $2$ trillion token pre-training corpora. In this work, we experiment with the English-dominated Llama-2 for cross-lingual transfer to three Indic languages, Bengali, Hindi, and Tamil as target languages. We study three approaches for cross-lingual transfer, under ICL and fine-tuning. One, we find that adding additional supervisory signals via a dominant language in the LLM, leads to improvements, both under in-context learning and fine-tuning. Two, adapting the target languages to word reordering may be beneficial under ICL, but its impact diminishes with fine tuning. Finally, continued pre-training in one low-resource language can improve model performance for other related low-resource languages.

A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs

TL;DR

The paper investigates three cross-lingual adaptation strategies—Handholding, Masquerading, and Bridging—for adapting an English-centric Llama-2-7b-chat model to Bengali, Hindi, and Tamil under low-resource constraints. By framing slot filling and NER as text-to-text tasks and evaluating with ICL and PEFT, it demonstrates that Handholding (using English supervision) and Bridging (Hindi continual pre-training) yield the strongest improvements, while Masquerading offers limited benefit, particularly under PEFT. The combination of Handholding and Bridging achieves the best overall performance, highlighting the value of leveraging a predominant language and a related language to enrich multilingual representation. These findings have practical implications for deploying LLMs in underrepresented languages, suggesting targeted pre-training and cross-lingual prompting as effective strategies when resources are scarce.

Abstract

Low-resource languages, by its very definition, tend to be under represented in the pre-training corpora of Large Language Models. In this work, we investigate three low-resource cross-lingual approaches that enable an LLM adapt to tasks in previously unseen languages. Llama-2 is an LLM where Indic languages, among many other language families, contribute to less than of the total trillion token pre-training corpora. In this work, we experiment with the English-dominated Llama-2 for cross-lingual transfer to three Indic languages, Bengali, Hindi, and Tamil as target languages. We study three approaches for cross-lingual transfer, under ICL and fine-tuning. One, we find that adding additional supervisory signals via a dominant language in the LLM, leads to improvements, both under in-context learning and fine-tuning. Two, adapting the target languages to word reordering may be beneficial under ICL, but its impact diminishes with fine tuning. Finally, continued pre-training in one low-resource language can improve model performance for other related low-resource languages.

Paper Structure

This paper contains 28 sections, 4 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Improved natural language understanding (NLU) and generation (NLG) of Llama-2-7b in Bengali and Tamil through continued pre-training in Hindi (Bridging) and leveraging English for cross-lingual transfer (Handholding).
  • Figure 2: Task of slot filling, using the cross-lingual transfer objective from English to Hindi, using an LLM. In this example, the word 'sun' translates to 'sūraja' in Hindi and 'sunday' translates to 'ravivāra'. Thus, in the output. the LLM assigns the label $\underline{\textit{weather\_descriptor}}$ to the word 'sun' in Hindi, and the label $\underline{\textit{date}}$ to 'sunday' in Hindi. Refer to \ref{['train_prompt1']} and \ref{['train_prompt2']} for details on the prompt.
  • Figure 3: English follows subject verb object word order in contrast to Hindi. Hindi follows the word order of subject object verb As shown, $\mathbf{X^T}$ is presented in SOV order and $\text{re-ordered }\mathbf{X^T}$ is presented in SVO order. $\text{transliterated }\mathbf{X^T}$ is $\mathbf{X^T}$ in Latin script using ISO 15919:2001. Here, only the script of $\mathbf{X^T}$ is changed, keeping the word order of Hindi.
  • Figure 4: Here, $\text{oracle }\mathbf{Z^S}$ refers to the ground-truth annotation of $\mathbf{X^S}$. $\text{pseudo }\mathbf{Z^S}$ is obtained after passing $\mathbf{X^S}$ through an xlm-roberta-base token classification model.