Table of Contents
Fetching ...

Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

Rakesh Paul, Anusha Kamath, Kanishk Singla, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

TL;DR

This work tackles the persistent gap in LLM performance between English and low-resource languages by proposing LLM-based selective translation to generate high-quality Hindi alignment data that preserves non-translatable content. It implements a two-stage alignment (SFT followed by DPO) using mixed English+Hindi data, guided by FAITH-based quality filtering and a safety-aware data pipeline. Through systematic experiments on Hindi with comparisons to vanilla translation, the study shows that selective translation yields superior multilingual alignment, with even small amounts of Hindi data producing meaningful gains and data quality driving efficiency. The results suggest that mixing languages and rigorous data curation can substantially improve Hindi LLM capabilities, with broad implications for inclusive AI in low-resource settings.

Abstract

Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.

Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study

TL;DR

This work tackles the persistent gap in LLM performance between English and low-resource languages by proposing LLM-based selective translation to generate high-quality Hindi alignment data that preserves non-translatable content. It implements a two-stage alignment (SFT followed by DPO) using mixed English+Hindi data, guided by FAITH-based quality filtering and a safety-aware data pipeline. Through systematic experiments on Hindi with comparisons to vanilla translation, the study shows that selective translation yields superior multilingual alignment, with even small amounts of Hindi data producing meaningful gains and data quality driving efficiency. The results suggest that mixing languages and rigorous data curation can substantially improve Hindi LLM capabilities, with broad implications for inclusive AI in low-resource settings.

Abstract

Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.

Paper Structure

This paper contains 12 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: English to Hindi translation examples using LLM-based selective translation and vanilla GCP translation.
  • Figure 2: Overall training pipeline comprising translation, filtering, SFT, and DPO stages.
  • Figure 3: Hybrid approach for selective translation-based data curation pipeline with safety considerations. The unsafe queries contain harmful, biased, or inappropriate content that LLMs typically decline to translate.
  • Figure 4: A/B comparison of translation quality, judged by Llama-3.1-Nemotron-70B-Instruct. The graph illustrates the percentage preference for LLM, GCP, both, or neither across various SFT dataset categories.
  • Figure 5: Percentage of LLM and GCP translated SFT data filtered by the Llama-3.1-Nemotron-70B-Instruct judge model, representing samples not achieving full scores.
  • ...and 5 more figures