NoLoR: An ASR-Based Framework for Expedited Endangered Language Documentation with Neo-Aramaic as a Case Study
Matthew Nazari
TL;DR
This work tackles the urgent task of documenting endangered Neo-Aramaic dialects by introducing NoLoR, an ASR-based framework designed to expedite transcription and data collection. The approach defines a phonemic orthography, builds an initial dataset, fine-tunes a wav2vec 2.0 model, and iteratively expands the dataset through ongoing data collection and crowdsourcing. Applied to the Urmi dialect (C. Urmi), NoLoR yields a CER of about 12.5% and can accelerate transcription by up to 6.3x, while relying on relatively small, carefully prepared datasets. The contributions include a publicly available 35-minute dataset, an ASR model, and the AssyrianVoices crowdsourcing platform, demonstrating a practical pathway to scale endangered-language documentation and uplift community involvement.
Abstract
The documentation of the Neo-Aramaic dialects before their extinction has been described as the most urgent task in all of Semitology today. The death of this language will be an unfathomable loss to the descendents of the indigenous speakers of Aramaic, now predominantly diasporic after forced displacement due to violence. This paper develops an ASR model to expedite the documentation of this endangered language and generalizes the strategy in a new framework we call NoLoR.
