Table of Contents
Fetching ...

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Daul, Alessio Tosolini, Claire Bowern

TL;DR

The paper addresses the scarcity of ASR resources for underresourced languages by testing linguistically informed tokenization on Yan-nhangu. It compares phonemic versus orthographic tokenization using a fine-tuned wav2vec2-BERT 2.0 model with CTC loss, showing that phonemic tokenization reduces WER and CER across data sizes, with meaningful gains emerging around 30 minutes of data and saturating near 90 minutes. In addition, it demonstrates that hand-correcting ASR output is substantially faster than manual transcription, enabling a practical, accelerated language documentation workflow. The findings support the viability of linguistically informed tokenization to improve ASR performance and efficiency in language documentation for underresourced languages.

Abstract

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

TL;DR

The paper addresses the scarcity of ASR resources for underresourced languages by testing linguistically informed tokenization on Yan-nhangu. It compares phonemic versus orthographic tokenization using a fine-tuned wav2vec2-BERT 2.0 model with CTC loss, showing that phonemic tokenization reduces WER and CER across data sizes, with meaningful gains emerging around 30 minutes of data and saturating near 90 minutes. In addition, it demonstrates that hand-correcting ASR output is substantially faster than manual transcription, enabling a practical, accelerated language documentation workflow. The findings support the viability of linguistically informed tokenization to improve ASR performance and efficiency in language documentation for underresourced languages.

Abstract

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

Paper Structure

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Waveform, spectrograph, and annotations for a sample testpoint.
  • Figure 2: Model Comparisons by WER, CER, and loss
  • Figure 3: Counts for deletions, insertions, and substitutions for the best ASR models using phonemic and orthographic tokenization.