Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe
TL;DR
The paper tackles rapid language adaptation for multilingual end-to-end ASR by embedding encoder prompts into a self-conditioned CTC framework, enabling zero-shot, language-specific CTC adaptation when the input language is predetermined. It introduces three prompt schemes—Replacement, Aggregation, and Prefix—and also explores soft prompting to bias encoder outputs toward a target language during inference without retraining. Through extensive experiments on Common Voice+VoxForge and FLEURS, the authors show substantial reductions in error rates, notably a 28% average relative improvement overall and up to 41% for extremely low-resource languages, with high language-ID fidelity in many cases. This approach offers a practical, inference-time mechanism to tailor a single multilingual model to specific languages, benefiting deployments on personal devices and multilingual services by reducing the need for language-specific retraining.
Abstract
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
