Table of Contents
Fetching ...

How I Built ASR for Endangered Languages with a Spoken Dictionary

Christopher Bartley, Anton Ragni

TL;DR

The paper tackles the lack of utterance-level supervision for endangered-language ASR by demonstrating that short-form pronunciation resources, when combined with long-form audio and text data, can yield usable transcription performance for Manx and Cornish. It uses a data-efficient pipeline that converts long-form recordings into utterance-level segments via forced alignment, augments training with external language models, and compares traditional HMM-based ASR with end-to-end approaches across limited supervision. Key findings show that a modest amount of short-form data can establish a viable baseline and that data diversity and LM integration significantly influence performance, with Whisper excelling in conversational domains while HMM-based systems perform robustly on careful and read speech. The work offers a practical path for endangered-language communities to deploy ASR without costly utterance-level corpora, enabling access to oral archives and language revival efforts.

Abstract

Nearly half of the world's languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic ($\sim$2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx ($<$50\% WER). We replicate our approach, applying it to Cornish ($\sim$600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.

How I Built ASR for Endangered Languages with a Spoken Dictionary

TL;DR

The paper tackles the lack of utterance-level supervision for endangered-language ASR by demonstrating that short-form pronunciation resources, when combined with long-form audio and text data, can yield usable transcription performance for Manx and Cornish. It uses a data-efficient pipeline that converts long-form recordings into utterance-level segments via forced alignment, augments training with external language models, and compares traditional HMM-based ASR with end-to-end approaches across limited supervision. Key findings show that a modest amount of short-form data can establish a viable baseline and that data diversity and LM integration significantly influence performance, with Whisper excelling in conversational domains while HMM-based systems perform robustly on careful and read speech. The work offers a practical path for endangered-language communities to deploy ASR without costly utterance-level corpora, enabling access to oral archives and language revival efforts.

Abstract

Nearly half of the world's languages are endangered. Speech technologies such as Automatic Speech Recognition (ASR) are central to revival efforts, yet most languages remain unsupported because standard pipelines expect utterance-level supervised data. Speech data often exist for endangered languages but rarely match these formats. Manx Gaelic (2,200 speakers), for example, has had transcribed speech since 1948, yet remains unsupported by modern systems. In this paper, we explore how little data, and in what form, is needed to build ASR for critically endangered languages. We show that a short-form pronunciation resource is a viable alternative, and that 40 minutes of such data produces usable ASR for Manx (50\% WER). We replicate our approach, applying it to Cornish (600 speakers), another critically endangered language. Results show that the barrier to entry, in quantity and form, is far lower than previously thought, giving hope to endangered language communities that cannot afford to meet the requirements arbitrarily imposed upon them.

Paper Structure

This paper contains 15 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: %WER performance of models trained on progressively larger amounts of Manx spoken dictionary data and new utterances from specific domains, assessed across careful, read, and conversational speech test sets. Shaded areas denote the gain from external LM integration (top vs. bottom line).