Table of Contents
Fetching ...

NoLoR: An ASR-Based Framework for Expedited Endangered Language Documentation with Neo-Aramaic as a Case Study

Matthew Nazari

TL;DR

This work tackles the urgent task of documenting endangered Neo-Aramaic dialects by introducing NoLoR, an ASR-based framework designed to expedite transcription and data collection. The approach defines a phonemic orthography, builds an initial dataset, fine-tunes a wav2vec 2.0 model, and iteratively expands the dataset through ongoing data collection and crowdsourcing. Applied to the Urmi dialect (C. Urmi), NoLoR yields a CER of about 12.5% and can accelerate transcription by up to 6.3x, while relying on relatively small, carefully prepared datasets. The contributions include a publicly available 35-minute dataset, an ASR model, and the AssyrianVoices crowdsourcing platform, demonstrating a practical pathway to scale endangered-language documentation and uplift community involvement.

Abstract

The documentation of the Neo-Aramaic dialects before their extinction has been described as the most urgent task in all of Semitology today. The death of this language will be an unfathomable loss to the descendents of the indigenous speakers of Aramaic, now predominantly diasporic after forced displacement due to violence. This paper develops an ASR model to expedite the documentation of this endangered language and generalizes the strategy in a new framework we call NoLoR.

NoLoR: An ASR-Based Framework for Expedited Endangered Language Documentation with Neo-Aramaic as a Case Study

TL;DR

This work tackles the urgent task of documenting endangered Neo-Aramaic dialects by introducing NoLoR, an ASR-based framework designed to expedite transcription and data collection. The approach defines a phonemic orthography, builds an initial dataset, fine-tunes a wav2vec 2.0 model, and iteratively expands the dataset through ongoing data collection and crowdsourcing. Applied to the Urmi dialect (C. Urmi), NoLoR yields a CER of about 12.5% and can accelerate transcription by up to 6.3x, while relying on relatively small, carefully prepared datasets. The contributions include a publicly available 35-minute dataset, an ASR model, and the AssyrianVoices crowdsourcing platform, demonstrating a practical pathway to scale endangered-language documentation and uplift community involvement.

Abstract

The documentation of the Neo-Aramaic dialects before their extinction has been described as the most urgent task in all of Semitology today. The death of this language will be an unfathomable loss to the descendents of the indigenous speakers of Aramaic, now predominantly diasporic after forced displacement due to violence. This paper develops an ASR model to expedite the documentation of this endangered language and generalizes the strategy in a new framework we call NoLoR.

Paper Structure

This paper contains 26 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Area where North-Eastern Neo-Aramaic dialects are spoken by indigenous communities (in red). The city of Urmi is the center for one of the most common dialects.
  • Figure 2: The NoLoR framework describes a positive feedback loop following two preliminary stages. In this loop, language documentation teams collect more data and their efficiency to transcribe increases.
  • Figure 3: Example of refining original transcriptions for machine learning tasks.
  • Figure 4: Data from the language documentation effort will not be formatted for machine learning tasks and must be processed accordingly.
  • Figure 5: Data Augmentation
  • ...and 1 more figures