Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't
Chihiro Taguchi, David Chiang
TL;DR
This work examines how linguistic writing-system properties influence ASR accuracy by fine-tuning a multilingual self-supervised model (Wav2Vec2-XLSR-53) on 25 languages with diverse scripts. It introduces two metrics—logographicity (attention-based measure) and Calibrated Errors Per Second (CEPS)—to compare cross-language performance without bias from cross-linguistic token definitions. The results show a robust link between orthographic complexity (grapheme inventory size, unigram entropy, and logographicity) and higher error rates (CER), while phoneme inventory size shows no significant effect, and CEPS partially mitigates orthographic effects. The findings highlight the challenge orthography poses for ASR in multilingual settings and suggest avenues for language inclusion and robust training in languages with complex scripts, while noting English advantages likely stem from targeted pretraining data and clearer corpora.
Abstract
We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.
