Efficient Adaptation of Multilingual Models for Japanese ASR
Mark Bajo, Haruka Fukukawa, Ryuji Morita, Yuma Ogasawara
TL;DR
This paper tackles Japanese ASR within multilingual Whisper by fine-tuning on Japanese data using LoRA adaptation and end-to-end training. It leverages four Japanese datasets and SpecAugment to bolster robustness, achieving substantial CER reductions: from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning for Whisper-Tiny, while End-to-End on Whisper-Base reaches about 10.07 CER and Small-End-to-End about 7.38 CER; ReazonSpeech baselines reach as low as 4.62 CER. The findings demonstrate that language-specific specialization of multilingual models can deliver strong per-language performance while retaining cross-language flexibility, enabling scalable, resource-efficient ASR improvements in languages with complex scripts. However, domain-specific terminology remains a challenge, indicating a need for targeted datasets to fully bridge the remaining gaps.
Abstract
This study explores fine-tuning multilingual ASR (Automatic Speech Recognition) models, specifically OpenAI's Whisper-Tiny, to improve performance in Japanese. While multilingual models like Whisper offer versatility, they often lack precision in specific languages. Conversely, monolingual models like ReazonSpeech excel in language-specific tasks but are less adaptable. Using Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end (E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show that fine-tuning reduced Whisper-Tiny's Character Error Rate (CER) from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing Whisper-Base's CER of 20.2. However, challenges with domain-specific terms remain, highlighting the need for specialized datasets. These findings demonstrate that fine-tuning multilingual models can achieve strong language-specific performance while retaining their flexibility. This approach provides a scalable solution for improving ASR in resource-constrained environments and languages with complex writing systems like Japanese.
