Table of Contents
Fetching ...

Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

Jesus Alvarez C, Daua D. Karajeanes, Ashley Celeste Prado, John Ruttan, Ivory Yang, Sean O'Brien, Vasu Sharma, Kevin Zhu

TL;DR

This work tackles digital marginalization of endangered languages by introducing Comanche, an endangered Uto-Aztecan language, to NLP. It proposes a minimal-resource workflow combining a manually collected 412-phrase dataset with a GPT-4o-based synthetic data pipeline and few-shot language-identification experiments. Key findings show zero-shot LLMs struggle with Comanche, while few-shot prompting dramatically improves language identification accuracy to 100% with three Comanche examples. The work demonstrates feasibility of community-informed NLP interventions to support language preservation and calls for ethical engagement and broader resource development.

Abstract

The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.

Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

TL;DR

This work tackles digital marginalization of endangered languages by introducing Comanche, an endangered Uto-Aztecan language, to NLP. It proposes a minimal-resource workflow combining a manually collected 412-phrase dataset with a GPT-4o-based synthetic data pipeline and few-shot language-identification experiments. Key findings show zero-shot LLMs struggle with Comanche, while few-shot prompting dramatically improves language identification accuracy to 100% with three Comanche examples. The work demonstrates feasibility of community-informed NLP interventions to support language preservation and calls for ethical engagement and broader resource development.

Abstract

The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.

Paper Structure

This paper contains 14 sections, 13 figures.

Figures (13)

  • Figure 1: Stylized overview of our exploration of NLP applications for the endangered Comanche language.
  • Figure 2: Family tree for Uto-Aztecan Languages, with Comanche highlighted.
  • Figure 3: Data pipeline.
  • Figure 4: GPT-4o achieves a remarkable improvement in language identification performance, with the help of few-shot examples.
  • Figure 5: Effect of Few-Shot Examples on Comanche Prediction Accuracy.
  • ...and 8 more figures