Table of Contents
Fetching ...

LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages

Jared Coleman, Bhaskar Krishnamachari, Khalil Iskarous, Ruben Rosales

TL;DR

The paper tackles machine translation for no-resource languages by proposing LLM-Assisted Rule-Based MT (LLM-RBMT), enabling translation without parallel corpora. It applies the paradigm to Owens Valley Paiute using a rule-based sentence builder and LLM-driven OVP translation steps, both OVP→English and English→OVP, with a teaching-revitalization focus. Evaluation relies on semantic similarity metrics due to lack of bilingual data, and shows strong results for constrained vocabulary translations and reveals limitations when vocabulary is incomplete. The work contributes a practical, extensible toolchain for endangered-language revitalization and provides a framework for extending LLM-assisted RBMT to other no-resource languages.

Abstract

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.

LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages

TL;DR

The paper tackles machine translation for no-resource languages by proposing LLM-Assisted Rule-Based MT (LLM-RBMT), enabling translation without parallel corpora. It applies the paradigm to Owens Valley Paiute using a rule-based sentence builder and LLM-driven OVP translation steps, both OVP→English and English→OVP, with a teaching-revitalization focus. Evaluation relies on semantic similarity metrics due to lack of bilingual data, and shows strong results for constrained vocabulary translations and reveals limitations when vocabulary is incomplete. The work contributes a practical, extensible toolchain for endangered-language revitalization and provides a framework for extending LLM-assisted RBMT to other no-resource languages.

Abstract

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.
Paper Structure (11 sections, 10 figures, 3 tables)

This paper contains 11 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Few-shot examples for translating "Wo'ada-ii pagwi-noka u-zawa-dü." using gpt-3.5-turbo.
  • Figure 2: Few-shot training examples for the English to OVP using gpt-3.5-turbo.
  • Figure 3: The entire English to OVP translation process. The box with a red, dashed border indicates the set of sentences in Owens Valley Paiute (the target language) and the box with a blue, dashed border indicates the set of English sentences they translate to. Ideally, the input sentence, simple sentences, and English output sentences will have equivalent or very similar semantic meaning.
  • Figure 4: Results for subject-verb sentences. The dark, medium, and light gray bands represent the baseline similarity (between unrelated sentences in the dataset) +/- one, two, and three standard deviations, respectively.
  • Figure 7: Results for subject-verb sentences. The dark, medium, and light gray bands represent the baseline similarity (between unrelated sentences in the dataset) +/- one, two, and three standard deviations, respectively.
  • ...and 5 more figures