Table of Contents
Fetching ...

Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't

Chihiro Taguchi, David Chiang

TL;DR

This work examines how linguistic writing-system properties influence ASR accuracy by fine-tuning a multilingual self-supervised model (Wav2Vec2-XLSR-53) on 25 languages with diverse scripts. It introduces two metrics—logographicity (attention-based measure) and Calibrated Errors Per Second (CEPS)—to compare cross-language performance without bias from cross-linguistic token definitions. The results show a robust link between orthographic complexity (grapheme inventory size, unigram entropy, and logographicity) and higher error rates (CER), while phoneme inventory size shows no significant effect, and CEPS partially mitigates orthographic effects. The findings highlight the challenge orthography poses for ASR in multilingual settings and suggest avenues for language inclusion and robust training in languages with complex scripts, while noting English advantages likely stem from targeted pretraining data and clearer corpora.

Abstract

We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.

Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't

TL;DR

This work examines how linguistic writing-system properties influence ASR accuracy by fine-tuning a multilingual self-supervised model (Wav2Vec2-XLSR-53) on 25 languages with diverse scripts. It introduces two metrics—logographicity (attention-based measure) and Calibrated Errors Per Second (CEPS)—to compare cross-language performance without bias from cross-linguistic token definitions. The results show a robust link between orthographic complexity (grapheme inventory size, unigram entropy, and logographicity) and higher error rates (CER), while phoneme inventory size shows no significant effect, and CEPS partially mitigates orthographic effects. The findings highlight the challenge orthography poses for ASR in multilingual settings and suggest avenues for language inclusion and robust training in languages with complex scripts, while noting English advantages likely stem from targeted pretraining data and clearer corpora.

Abstract

We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.
Paper Structure (26 sections, 7 equations, 5 figures, 5 tables)

This paper contains 26 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Visualization of the self-supervised pretraining step of Wav2Vec 2.0.
  • Figure 2: A visualization of attention masking. The top matrix shows the original distributions of attention scores with a Japanese phoneme input and an orthographic output of the target word. The bottom matrix has zeroed-out attention values for the cells corresponding to the target word. The logographicity score $S_\text{token}$ measures how much information is retained after masking. Values near 1 are in yellow and those near 0 in dark purple.
  • Figure 3: Calibrated errors per second (solid) compared with raw errors per second (dashed), assuming $\tau=1$.
  • Figure 4: CER versus various measures of linguistic complexity.
  • Figure 5: Comparison of validation CERs during the training with different writing systems for Japanese (top), Korean (middle), and Chinese (bottom).