Table of Contents
Fetching ...

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach

TL;DR

LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.

Abstract

Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation LexC-Gen, a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. Through ablation study, we show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

TL;DR

LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.

Abstract

Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation LexC-Gen, a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. Through ablation study, we show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.
Paper Structure (64 sections, 12 figures, 10 tables)

This paper contains 64 sections, 12 figures, 10 tables.

Figures (12)

  • Figure 1: We observe data-lexicon mismatch (i.e., low lexical overlap) between existing task data and bilingual lexicons (\ref{['fig:lexcgen-intuition']}). LexC-Gen addresses the issue by generating data using words from lexicons so the data will have more words translated (i.e., higher word translation coverage) and higher lexicon utilization rate (\ref{['fig:lexcgen-coverage-utilrate']}).
  • Figure 2: LexC-Gen Given a bilingual lexicon and the set of classes for a classification task, (1) we randomly sample the class label and a set of words from bilingual lexicon, for as many instances we desire to generate. (2) We use these pairs to build the prompts to query CTG-trained LLM (\ref{['fig:ctg']}) and generate the task data in high-resource language. (3) Then, we train a task classifier on existing task data to filter generated data and ensure input-label consistency. (4) After filtering, we apply word-to-word translation with the bilingual lexicon following prior work wang-etal-2022-expanding. Finally we get the synthetic task data for the target low-resource language, which is used to finetune task classifier.
  • Figure 3: Controlled-Text Generation (CTG) training. This figure shows the pipeline for the LLM finetuning for CTG. We construct the training data starting from the existing labeled task data $\mathcal{T}_H$. From each instance $t_H$, we sample without replacement a set of words $W_H$ and associate it to class $c$. This information is plugged into the prompt template, and it is used to finetune an LLM that generates sentences conditioned on $c$ and $W_H$.
  • Figure 4: Ablation study of lexicon-conditioning in LexC-Gen-100K on sentiment analysis. The plot shows that accuracy difference against finetuning with gold translations (green dotted line).
  • Figure 5: Sentiment analysis accuracy (red solid line, left y-axis) and lexicon utilization rate (blue dotted line, right y-axis) against the size of LexC-Gen training task data in log10-scale.
  • ...and 7 more figures