Table of Contents
Fetching ...

Where Are We At with Automatic Speech Recognition for the Bambara Language?

Seydou Diallo, Yacouba Diarra, Mamadou K. Keita, Panga Azazia Kamaté, Adam Bouno Kampo, Aboubacar Ouattara

TL;DR

This work presents the first standardized benchmark and public leaderboard for Bambara ASR, using a one-hour studio recording of the Malian constitution to enable controlled, comparable evaluation. Across 37 models, including monolingual, multilingual, and proprietary systems, the best WER is around 47.5% and the best CER around 13.6%, with many multilingual models catastrophically failing (WER > 100%), highlighting the limits of transfer learning for underrepresented languages. The results show that data scarcity, domain mismatch, orthographic variation, and morphological complexity remain primary obstacles, and that model scale or multilingual pre-training alone are insufficient to reach production readiness in this language. The authors advocate standardized benchmarking, diverse and naturalistic data collection, and targeted architecture research, providing open data and a public leaderboard to drive ongoing progress toward practical Bambara ASR solutions.

Abstract

This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.

Where Are We At with Automatic Speech Recognition for the Bambara Language?

TL;DR

This work presents the first standardized benchmark and public leaderboard for Bambara ASR, using a one-hour studio recording of the Malian constitution to enable controlled, comparable evaluation. Across 37 models, including monolingual, multilingual, and proprietary systems, the best WER is around 47.5% and the best CER around 13.6%, with many multilingual models catastrophically failing (WER > 100%), highlighting the limits of transfer learning for underrepresented languages. The results show that data scarcity, domain mismatch, orthographic variation, and morphological complexity remain primary obstacles, and that model scale or multilingual pre-training alone are insufficient to reach production readiness in this language. The authors advocate standardized benchmarking, diverse and naturalistic data collection, and targeted architecture research, providing open data and a public leaderboard to drive ongoing progress toward practical Bambara ASR solutions.

Abstract

This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
Paper Structure (33 sections, 1 figure, 5 tables)