Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR
Chris Crawford
TL;DR
This work tackles the challenge of non-concatenative, tonal morphology in Yoloxóchitl Mixtec by developing morphologically-informed tokenizers that separate segments from tonal melodies (Segment-Melody) or encode tonal processes as a per-mora sequence (Sequence-of-Processes). Using wav2gloss-inspired data and FST-based pipelines, the study evaluates these tokenizers against baseline BPE/Unigram/WordPiece configurations on end-to-end ASR tasks, reporting that Segment-Melody often yields the best WER while traditional CER may favor BPE. Intrinsic metrics reveal Morphological F1 as a meaningful predictor of downstream ASR performance, supporting the claim that morphology-aligned tokenization improves processing of non-concatenative morphology. The results suggest nonlinear, morphologically tailored tokenizers are competitive with conventional subword models and highlight the need for gold-standard segmentation data and further optimization for practical wav2gloss pipelines.
Abstract
This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.
