Table of Contents
Fetching ...

AutoMode-ASR: Learning to Select ASR Systems for Better Quality and Cost

Ahmet Gündüz, Yunsu Kim, Kamer Ali Yuksel, Mohamed Al-Badrashiny, Thiago Castro Ferreira, Hassan Sawaf

TL;DR

AutoMode-ASR addresses the challenge of selecting the most suitable ASR system per audio segment without running all candidate models. It frames system selection as learning to rank via one-vs-pivot binary classifiers that leverage rich audio, embedding, confidence, and quality-estimation features, powered by XGBoost. The approach enables incremental system integration and maintains compatibility with both commercial and open-source black-box ASR systems. Empirical results on Common Voice and LibriSpeech across multiple languages show notable WER reductions (up to 16.2% relative) with substantial cost savings (up to 65%) and speed gains (75%), demonstrating practical benefits for scalable ASR deployment.

Abstract

We present AutoMode-ASR, a novel framework that effectively integrates multiple ASR systems to enhance the overall transcription quality while optimizing cost. The idea is to train a decision model to select the optimal ASR system for each segment based solely on the audio input before running the systems. We achieve this by ensembling binary classifiers determining the preference between two systems. These classifiers are equipped with various features, such as audio embeddings, quality estimation, and signal properties. Additionally, we demonstrate how using a quality estimator can further improve performance with minimal cost increase. Experimental results show a relative reduction in WER of 16.2%, a cost saving of 65%, and a speed improvement of 75%, compared to using a single-best model for all segments. Our framework is compatible with commercial and open-source black-box ASR systems as it does not require changes in model codes.

AutoMode-ASR: Learning to Select ASR Systems for Better Quality and Cost

TL;DR

AutoMode-ASR addresses the challenge of selecting the most suitable ASR system per audio segment without running all candidate models. It frames system selection as learning to rank via one-vs-pivot binary classifiers that leverage rich audio, embedding, confidence, and quality-estimation features, powered by XGBoost. The approach enables incremental system integration and maintains compatibility with both commercial and open-source black-box ASR systems. Empirical results on Common Voice and LibriSpeech across multiple languages show notable WER reductions (up to 16.2% relative) with substantial cost savings (up to 65%) and speed gains (75%), demonstrating practical benefits for scalable ASR deployment.

Abstract

We present AutoMode-ASR, a novel framework that effectively integrates multiple ASR systems to enhance the overall transcription quality while optimizing cost. The idea is to train a decision model to select the optimal ASR system for each segment based solely on the audio input before running the systems. We achieve this by ensembling binary classifiers determining the preference between two systems. These classifiers are equipped with various features, such as audio embeddings, quality estimation, and signal properties. Additionally, we demonstrate how using a quality estimator can further improve performance with minimal cost increase. Experimental results show a relative reduction in WER of 16.2%, a cost saving of 65%, and a speed improvement of 75%, compared to using a single-best model for all segments. Our framework is compatible with commercial and open-source black-box ASR systems as it does not require changes in model codes.
Paper Structure (10 sections, 2 figures, 4 tables)

This paper contains 10 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The Diagram of the AutoMode-ASR Workflow
  • Figure 2: Mean feature importance of binary classifiers.