Table of Contents
Fetching ...

QueEn: A Large Language Model for Quechua-English Translation

Junhao Chen, Peng Shu, Yiwei Li, Huaqin Zhao, Hanqi Jiang, Yi Pan, Yifan Zhou, Zhengliang Liu, Lewis C Howe, Tianming Liu

TL;DR

QueEn presents a retrieval-augmented framework that combines parameter-efficient fine-tuning (LoRA) with dual retrieval (keyword and embedding) to tackle Quechua–English translation in a low-resource setting. The method leverages external linguistic resources through RAG and adapts the model efficiently, addressing data scarcity while preserving core linguistic knowledge. Empirical results show that GPT+RAG achieves the strongest performance (BLEU ≈ 0.235, ROUGE ≈ 0.278, BERTScore ≈ 0.963), outperforming baselines including LLaMA and GPT without retrieval. The work contributes a scalable approach to endangered-language translation and suggests broader applicability to other low-resource languages, with potential impacts on linguistic preservation and socio-economic inclusion.

Abstract

Recent studies show that large language models (LLMs) are powerful tools for working with natural language, bringing advances in many areas of computational linguistics. However, these models face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. In this paper, we propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques. Our method leverages external linguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for efficient model adaptation. Experimental results show that our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models. The integration of RAG with fine-tuning allows our system to address the challenges of low-resource language translation while maintaining computational efficiency. This work contributes to the broader goal of preserving endangered languages through advanced language technologies.

QueEn: A Large Language Model for Quechua-English Translation

TL;DR

QueEn presents a retrieval-augmented framework that combines parameter-efficient fine-tuning (LoRA) with dual retrieval (keyword and embedding) to tackle Quechua–English translation in a low-resource setting. The method leverages external linguistic resources through RAG and adapts the model efficiently, addressing data scarcity while preserving core linguistic knowledge. Empirical results show that GPT+RAG achieves the strongest performance (BLEU ≈ 0.235, ROUGE ≈ 0.278, BERTScore ≈ 0.963), outperforming baselines including LLaMA and GPT without retrieval. The work contributes a scalable approach to endangered-language translation and suggests broader applicability to other low-resource languages, with potential impacts on linguistic preservation and socio-economic inclusion.

Abstract

Recent studies show that large language models (LLMs) are powerful tools for working with natural language, bringing advances in many areas of computational linguistics. However, these models face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. In this paper, we propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques. Our method leverages external linguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for efficient model adaptation. Experimental results show that our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models. The integration of RAG with fine-tuning allows our system to address the challenges of low-resource language translation while maintaining computational efficiency. This work contributes to the broader goal of preserving endangered languages through advanced language technologies.

Paper Structure

This paper contains 16 sections, 10 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Illustrating a retrieval-augmented generation (RAG) architecture: Documents are indexed using both keyword and embedding vector methods, stored in separate databases. A retrieval agent accesses these indexes to provide relevant information, which is then processed by a GPT-4 model to deliver responses to users.