Table of Contents
Fetching ...

LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

Chenxi Whitehouse, Monojit Choudhury, Alham Fikri Aji

TL;DR

This work addresses data scarcity in multilingual commonsense reasoning by using Large Language Models to synthesize training data for XCOPA, XWinograd, and XStoryCloze, and then fine-tuning small multilingual models (mBERT and XLM-R). It compares English-generated data, target-language generation, and translations, showing that LLM-generated data can significantly boost cross-lingual performance (e.g., up to ~13 points) and that GPT-4 generally provides the most robust improvements. A comprehensive human evaluation reveals strong naturalness Across many languages for ChatGPT and GPT-4, though Tamil remains challenging, and GPT-4 tends to produce more coherent logic than ChatGPT. The study releases synthesized datasets and highlights practical considerations, such as costs, language coverage, and the potential of open-source instruction-tuned LLMs as future work.

Abstract

This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency.

LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

TL;DR

This work addresses data scarcity in multilingual commonsense reasoning by using Large Language Models to synthesize training data for XCOPA, XWinograd, and XStoryCloze, and then fine-tuning small multilingual models (mBERT and XLM-R). It compares English-generated data, target-language generation, and translations, showing that LLM-generated data can significantly boost cross-lingual performance (e.g., up to ~13 points) and that GPT-4 generally provides the most robust improvements. A comprehensive human evaluation reveals strong naturalness Across many languages for ChatGPT and GPT-4, though Tamil remains challenging, and GPT-4 tends to produce more coherent logic than ChatGPT. The study releases synthesized datasets and highlights practical considerations, such as costs, language coverage, and the potential of open-source instruction-tuned LLMs as future work.

Abstract

This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency.
Paper Structure (21 sections, 2 figures, 12 tables)

This paper contains 21 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Human evaluation of 50 random examples from the original XCOPA, ChatGPT (top) and GPT-4 (bottom) generated data in target languages, and translation of English generated data. Examples are annotated by two native speakers in each language. The subplots in the last column show the logic issues of the XCOPA data, where the three bars for each language represent Oringal, $Gen_{XX}$, and $Gen_{EN}^{Trans}$ (from left to right).
  • Figure 2: Comparison between the 30 most frequent events and the lengths of the sentences in the original and the ChatGPT-generated English StoryCloze dataset.