Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

Sara Court; Micha Elsner

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

Sara Court, Micha Elsner

Abstract

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of context type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world's 7,000+ languages and their speakers.

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

Abstract

Paper Structure (37 sections, 1 figure, 12 tables)

This paper contains 37 sections, 1 figure, 12 tables.

Introduction
LLMs for Machine Translation
Quechuan Languages
Language-Specific Factors
Morphological Segmentation
Syncretism and Polysemy
Variation
Methods
Data
Prompt Construction
Morpheme Translations (morph)
Grammar Descriptions (grammar)
Parallel Usage Examples (corpus)
Combined Prompt Types
Manually Revised Prompts
...and 22 more sections

Figures (1)

Figure 1: Example baseline prompt. English: [TASK] Translate the following sentence from Quechua to Spanish. Respond only with the translation: Quechua: kay wasiqa turiypam; Spanish:

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

Abstract

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

Authors

Abstract

Table of Contents

Figures (1)