Table of Contents
Fetching ...

High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models

Michela Lorandi, Anya Belz

TL;DR

This work investigates data-to-text generation for severely under-resourced languages using out-of-the-box large language models (LLMs). It evaluates four LLMs (GPT-3.5, BLOOM, LLaMa2-chat, Falcon-chat) with three machine translation backends on Irish, Welsh, Breton, Maltese, and English, using WebNLG 2023 data. Key findings show that LLMs can achieve state-of-the-art or human-parity performance for under-resourced languages, though BLEU scores underreport quality for non-task-specific generation; translating to English before MT often yields stronger automatic metrics. The study highlights practical considerations such as cost, reproducibility, and prompting choices, and demonstrates the substantial bridging potential of LLMs for language resources gaps.

Abstract

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric's suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.

High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models

TL;DR

This work investigates data-to-text generation for severely under-resourced languages using out-of-the-box large language models (LLMs). It evaluates four LLMs (GPT-3.5, BLOOM, LLaMa2-chat, Falcon-chat) with three machine translation backends on Irish, Welsh, Breton, Maltese, and English, using WebNLG 2023 data. Key findings show that LLMs can achieve state-of-the-art or human-parity performance for under-resourced languages, though BLEU scores underreport quality for non-task-specific generation; translating to English before MT often yields stronger automatic metrics. The study highlights practical considerations such as cost, reproducibility, and prompting choices, and demonstrates the substantial bridging potential of LLMs for language resources gaps.

Abstract

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric's suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.
Paper Structure (26 sections, 2 figures, 10 tables)

This paper contains 26 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: WebNLG input set of triples and output text.
  • Figure 2: Screenshot of the human evaluation interface.