Table of Contents
Fetching ...

Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation

Peng Shu, Junhao Chen, Zhengliang Liu, Hui Wang, Zihao Wu, Tianyang Zhong, Yiwei Li, Huaqin Zhao, Hanqi Jiang, Yi Pan, Yifan Zhou, Constance Owl, Xiaoming Zhai, Ninghao Liu, Claudio Saunt, Tianming Liu

TL;DR

This paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data.

Abstract

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of tasks and domains. However, their performance in low-resource language translation, particularly when translating into these languages, remains underexplored. This gap poses significant challenges, as linguistic barriers hinder the cultural preservation and development of minority communities. To address this issue, this paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages. In contrast, our retrieval-based method shows promise in improving both word-level accuracy and overall semantic understanding by leveraging existing resources more effectively.

Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation

TL;DR

This paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data.

Abstract

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of tasks and domains. However, their performance in low-resource language translation, particularly when translating into these languages, remains underexplored. This gap poses significant challenges, as linguistic barriers hinder the cultural preservation and development of minority communities. To address this issue, this paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages. In contrast, our retrieval-based method shows promise in improving both word-level accuracy and overall semantic understanding by leveraging existing resources more effectively.

Paper Structure

This paper contains 22 sections, 1 equation, 33 figures, 2 tables.

Figures (33)

  • Figure 1: Illustrating a retrieval-augmented generation (RAG) architecture: Documents are indexed using both keyword and embedding vector methods, stored in separate databases. A retrieval agent accesses these indexes to provide relevant information, which is then processed by a GPT-4 model to deliver responses to users.
  • Figure 2: One test example on Cherokee New Testament given the ground truth translation. We test GPT-4o, GPT-4o with RAG and Llama 405B.
  • Figure 3: One example for testing LLMs translation on Peter Parley’s Geography.
  • Figure 4: One example for testing LLMs translation on The Pilgrim’s Progress.
  • Figure 5: We applied our retrieval-based model to translate a paragraph from the Kamala Harris debate, presenting the original English text on the left and the corresponding Cherokee translation on the right.
  • ...and 28 more figures