Table of Contents
Fetching ...

Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges

Vincent Koc

TL;DR

Generative AI and LLMs offer transformative potential for preserving endangered languages, but their deployment risks data sovereignty, bias, and cultural misrepresentation. The paper proposes an analytical framework that aligns GenAI capabilities with language communities' needs, embedding ethical safeguards and governance. Te Reo Māori serves as a detailed worked example, illustrating high-accuracy ASR outcomes and actionable strategies for community-led development. The authors introduce the ImpactScore rubric to guide responsible intervention prioritization, and discuss future research directions to advance low-resource learning, explainability, and culturally resonant metrics.

Abstract

The global crisis of language endangerment meets a technological turning point as Generative AI (GenAI) and Large Language Models (LLMs) unlock new frontiers in automating corpus creation, transcription, translation, and tutoring. However, this promise is imperiled by fragmented practices and the critical lack of a methodology to navigate the fraught balance between LLM capabilities and the profound risks of data scarcity, cultural misappropriation, and ethical missteps. This paper introduces a novel analytical framework that systematically evaluates GenAI applications against language-specific needs, embedding community governance and ethical safeguards as foundational pillars. We demonstrate its efficacy through the Te Reo Māori revitalization, where it illuminates successes, such as community-led Automatic Speech Recognition achieving 92% accuracy, while critically surfacing persistent challenges in data sovereignty and model bias for digital archives and educational tools. Our findings underscore that GenAI can indeed revolutionize language preservation, but only when interventions are rigorously anchored in community-centric data stewardship, continuous evaluation, and transparent risk management. Ultimately, this framework provides an indispensable toolkit for researchers, language communities, and policymakers, aiming to catalyze the ethical and high-impact deployment of LLMs to safeguard the world's linguistic heritage.

Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges

TL;DR

Generative AI and LLMs offer transformative potential for preserving endangered languages, but their deployment risks data sovereignty, bias, and cultural misrepresentation. The paper proposes an analytical framework that aligns GenAI capabilities with language communities' needs, embedding ethical safeguards and governance. Te Reo Māori serves as a detailed worked example, illustrating high-accuracy ASR outcomes and actionable strategies for community-led development. The authors introduce the ImpactScore rubric to guide responsible intervention prioritization, and discuss future research directions to advance low-resource learning, explainability, and culturally resonant metrics.

Abstract

The global crisis of language endangerment meets a technological turning point as Generative AI (GenAI) and Large Language Models (LLMs) unlock new frontiers in automating corpus creation, transcription, translation, and tutoring. However, this promise is imperiled by fragmented practices and the critical lack of a methodology to navigate the fraught balance between LLM capabilities and the profound risks of data scarcity, cultural misappropriation, and ethical missteps. This paper introduces a novel analytical framework that systematically evaluates GenAI applications against language-specific needs, embedding community governance and ethical safeguards as foundational pillars. We demonstrate its efficacy through the Te Reo Māori revitalization, where it illuminates successes, such as community-led Automatic Speech Recognition achieving 92% accuracy, while critically surfacing persistent challenges in data sovereignty and model bias for digital archives and educational tools. Our findings underscore that GenAI can indeed revolutionize language preservation, but only when interventions are rigorously anchored in community-centric data stewardship, continuous evaluation, and transparent risk management. Ultimately, this framework provides an indispensable toolkit for researchers, language communities, and policymakers, aiming to catalyze the ethical and high-impact deployment of LLMs to safeguard the world's linguistic heritage.
Paper Structure (25 sections, 2 equations, 3 figures, 2 tables)

This paper contains 25 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Taxonomy of Opportunities and Challenges in Applying Generative AI to Language Preservation.
  • Figure 2: The proposed analytical framework detailing inputs, core processes (numbered 1 to 3 to visually echo the text), a functional mapping summary, and outputs for assessing GenAI applications in language preservation.
  • Figure 3: A human-centered framework for GenAI initiatives, illustrating the dual phases of problem identification and solution implementation, with iterative feedback between readiness, strategy, use case discovery, operating model, infrastructure, and awareness. Adapted and redesigned based on PwC pwc2025.