Table of Contents
Fetching ...

Linguistically Informed Evaluation of Multilingual ASR for African Languages

Fei-Yueh Chen, Lateef Adeleke, C. M. Downey

TL;DR

WER inadequately captures ASR performance for tonal African languages; this paper introduces FER and TER as linguistically informed metrics and demonstrates their value across Yoruba and Uneme. Using mHuBERT-based encoders and transformer decoders, the study shows substantial phonological learning even when word-level accuracy is poor, with FER/TER revealing errors in tone and vowel features that WER misses. The Uneme baseline provides new data for an endangered language, and the results highlight domain-shift effects, particularly with careful speech, and the superiority of Transformer decoders over linear mappings. The work argues for pitch-aware modeling and collaboration with field linguists to create dataset and evaluation frameworks that respect African language typology, with practical implications for improving low-resource ASR systems.

Abstract

Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.

Linguistically Informed Evaluation of Multilingual ASR for African Languages

TL;DR

WER inadequately captures ASR performance for tonal African languages; this paper introduces FER and TER as linguistically informed metrics and demonstrates their value across Yoruba and Uneme. Using mHuBERT-based encoders and transformer decoders, the study shows substantial phonological learning even when word-level accuracy is poor, with FER/TER revealing errors in tone and vowel features that WER misses. The Uneme baseline provides new data for an endangered language, and the results highlight domain-shift effects, particularly with careful speech, and the superiority of Transformer decoders over linear mappings. The work argues for pitch-aware modeling and collaboration with field linguists to create dataset and evaluation frameworks that respect African language typology, with practical implications for improving low-resource ASR systems.

Abstract

Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
Paper Structure (32 sections, 1 equation, 11 tables)