Table of Contents
Fetching ...

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

Vivek Iyer, Bhavitvya Malik, Pavel Stepachev, Pinzhen Chen, Barry Haddow, Alexandra Birch

TL;DR

It is shown that for low-resource LLM-MT, the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer.

Abstract

Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource languages (LRLs) still lags significantly behind Neural Machine Translation (NMT) models. In this work, we explore what it would take to adapt LLMs for the low-resource setting. Particularly, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has seen reduced use in adapting LLMs for MT, while data diversity has been embraced to promote transfer across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer. Our experiments with three LLMs across two low-resourced language groups -- Indigenous American and North-East Indian -- reveal consistent trends, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve LRLs.

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

TL;DR

It is shown that for low-resource LLM-MT, the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer.

Abstract

Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource languages (LRLs) still lags significantly behind Neural Machine Translation (NMT) models. In this work, we explore what it would take to adapt LLMs for the low-resource setting. Particularly, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has seen reduced use in adapting LLMs for MT, while data diversity has been embraced to promote transfer across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer. Our experiments with three LLMs across two low-resourced language groups -- Indigenous American and North-East Indian -- reveal consistent trends, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve LRLs.
Paper Structure (41 sections, 7 figures, 14 tables)

This paper contains 41 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Strategies explored for incorporating parallel data during Continued Pre-Training. We show a Spanish (es) to Aymara (aym) example from our parallel data.
  • Figure 2: Comparing Llama3 8B models pre-trained on monolingual data alone versus those included parallel data too---concatenated, or as separate texts at various scales. All models were pre-trained on 1M, 3M, 5M, 8M, and 13M sentences respectively, and markers denote the corresponding token counts. The y-axis shows chrF++ post SFT on 500K spa-X MT data for 1 epoch.
  • Figure 3: Scaling up Llama3 8B models with different CPT recipes (no CPT, CPT with monolingual data, and CPT with a mixture of monolingual and parallel data) on MT data for the American languages. $^\dag$'High-resource' refers to the relatively higher-resourced languages in our low-resource setup (Aymara, Guarani and Quechua) while the other 8 are grouped as low-resource$^\eta$.
  • Figure 4: Scaling up Llama 3 8B models with the 3 CPT recipes for the 4 Indic languages, until 5M sentences. We were forced to stop training 'No CPT' at 2.5M sentences, constrained by budget.
  • Figure 5: Epoch vs performance graph for low-resource LLM-MT. We use the entire 1M spa-X MT dataset, and plot average chrF++ for the Indigenous American languages, using Llama3 (Mono+Parallel (concat)) model.
  • ...and 2 more figures