AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
Gabrial Zencha Ashungafac, Mardhiyah Sanni, Busayo Awobade, Alex Gichamba, Tobi Olatunji
TL;DR
AfriSpeech-MultiBench addresses a critical gap in ASR evaluation by creating a domain-specific, zero-shot benchmark for African English accents across seven application domains and multiple countries. It harmonizes seven African corpora and evaluates 19 diverse models, revealing a substantial gap between standard benchmarks and real-world African usage, with regionally tuned models like Intron-Sahara V2 delivering superior accuracy and robustness. The work provides fine-grained error analyses by accent, domain, and model class, highlighting persistent challenges in named entities, conversational speech, and robustness, and it demonstrates the practical value of regionalized training for deployment. By publicly releasing the benchmark and conducting comprehensive cross-domain assessments, the paper offers actionable guidance for building inclusive, Africa-ready ASR systems and motivates further data collection and model adaptation in low-resource settings.
Abstract
Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
