Table of Contents
Fetching ...

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

Hongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie

TL;DR

This work tackles cross-language variability and data imbalance in multilingual ASR by introducing SIMA, a selective invocation framework built on a base spoken large language model (SLLM). SIMA jointly decides when to transcribe directly or invoke specialized SOTA models, using an 'Invocation Uncertain' state and a fusion confidence strategy with thresholds $P$, $E$, and $T$. A data pipeline generates training data by labeling samples with WER-based invocation categories and language confidence, enabling dynamic, cost-aware routing. Experiments on MLS, VoxPopuli, and FLEURS show SIMA achieves up to 18.7% relative WER reduction and about 51% cost savings versus LID-based approaches, demonstrating scalable effectiveness for real-world multilingual ASR deployments.

Abstract

Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

TL;DR

This work tackles cross-language variability and data imbalance in multilingual ASR by introducing SIMA, a selective invocation framework built on a base spoken large language model (SLLM). SIMA jointly decides when to transcribe directly or invoke specialized SOTA models, using an 'Invocation Uncertain' state and a fusion confidence strategy with thresholds , , and . A data pipeline generates training data by labeling samples with WER-based invocation categories and language confidence, enabling dynamic, cost-aware routing. Experiments on MLS, VoxPopuli, and FLEURS show SIMA achieves up to 18.7% relative WER reduction and about 51% cost savings versus LID-based approaches, demonstrating scalable effectiveness for real-world multilingual ASR deployments.

Abstract

Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Three systems for multilingual ASR. (a) A single multilingual model, such as Whisper, which recognizes multiple languages with one model. (b) A language identification (LID)-based system that identifies the language and invokes the corresponding SOTA model. (c) Selective invocation for multilingual ASR (SIMA) that directly transcribes simpler speech and invokes SOTA models for more complex inputs.
  • Figure 2: The multitask training format of the SIMA model.
  • Figure 3: Data pipeline of the SIMA dataset.