Table of Contents
Fetching ...

Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR

Chris Crawford

TL;DR

This work tackles the challenge of non-concatenative, tonal morphology in Yoloxóchitl Mixtec by developing morphologically-informed tokenizers that separate segments from tonal melodies (Segment-Melody) or encode tonal processes as a per-mora sequence (Sequence-of-Processes). Using wav2gloss-inspired data and FST-based pipelines, the study evaluates these tokenizers against baseline BPE/Unigram/WordPiece configurations on end-to-end ASR tasks, reporting that Segment-Melody often yields the best WER while traditional CER may favor BPE. Intrinsic metrics reveal Morphological F1 as a meaningful predictor of downstream ASR performance, supporting the claim that morphology-aligned tokenization improves processing of non-concatenative morphology. The results suggest nonlinear, morphologically tailored tokenizers are competitive with conventional subword models and highlight the need for gold-standard segmentation data and further optimization for practical wav2gloss pipelines.

Abstract

This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.

Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR

TL;DR

This work tackles the challenge of non-concatenative, tonal morphology in Yoloxóchitl Mixtec by developing morphologically-informed tokenizers that separate segments from tonal melodies (Segment-Melody) or encode tonal processes as a per-mora sequence (Sequence-of-Processes). Using wav2gloss-inspired data and FST-based pipelines, the study evaluates these tokenizers against baseline BPE/Unigram/WordPiece configurations on end-to-end ASR tasks, reporting that Segment-Melody often yields the best WER while traditional CER may favor BPE. Intrinsic metrics reveal Morphological F1 as a meaningful predictor of downstream ASR performance, supporting the claim that morphology-aligned tokenization improves processing of non-concatenative morphology. The results suggest nonlinear, morphologically tailored tokenizers are competitive with conventional subword models and highlight the need for gold-standard segmentation data and further optimization for practical wav2gloss pipelines.

Abstract

This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.

Paper Structure

This paper contains 30 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples of the G3 format in various languages, showcasing different non-concatenative processes it can represent.mortensen2023generalized
  • Figure 2: Comparison of BPE tokenization and formal segmentation for Finnish (left) and Yoloxóchitl Mixtec (right). Each BPE token is represented by a colored box, and segmentation is presented in G3. Finnish example is glossed in Sentence (\ref{['sent:gloss-example-deep']}), and YM example is glossed in Sentence (\ref{['sent:tabi-tokenizer']})
  • Figure 3: A simple WFST that accepts all strings of the form baa*. For the input string "baa", it outputs "moo" with weight 3 and "noo" with weight 2.
  • Figure 4: An illustration of the unweighted machine $R_{\mathbf{\phi}\times\mathbf{\psi}}$ from the Mohri-Sproat construction. Here, $\Sigma:\Sigma$ denotes any character $c\in\Sigma$ being transduced to itself, and $\phi_1\dots\phi_n$ and $\psi_1\dots\psi_n$ are the respective expansions of $\mathbf{\phi}$ and $\mathbf{\psi}$, normalized to have the same length for simplicity.
  • Figure 5: Comparison of ASR performance metrics WER (left) and CER (right) and intrinsic tokenizer metrics Morph.-F1 (top) and Sparsity (bottom). Each point on the plot represents a tokenizer, using the average WER and CER from Table \ref{['tab:asr-results']}. Note that outliers may not appear in the graph due to cropping.