Table of Contents
Fetching ...

Efficient Adaptation of Multilingual Models for Japanese ASR

Mark Bajo, Haruka Fukukawa, Ryuji Morita, Yuma Ogasawara

TL;DR

This paper tackles Japanese ASR within multilingual Whisper by fine-tuning on Japanese data using LoRA adaptation and end-to-end training. It leverages four Japanese datasets and SpecAugment to bolster robustness, achieving substantial CER reductions: from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning for Whisper-Tiny, while End-to-End on Whisper-Base reaches about 10.07 CER and Small-End-to-End about 7.38 CER; ReazonSpeech baselines reach as low as 4.62 CER. The findings demonstrate that language-specific specialization of multilingual models can deliver strong per-language performance while retaining cross-language flexibility, enabling scalable, resource-efficient ASR improvements in languages with complex scripts. However, domain-specific terminology remains a challenge, indicating a need for targeted datasets to fully bridge the remaining gaps.

Abstract

This study explores fine-tuning multilingual ASR (Automatic Speech Recognition) models, specifically OpenAI's Whisper-Tiny, to improve performance in Japanese. While multilingual models like Whisper offer versatility, they often lack precision in specific languages. Conversely, monolingual models like ReazonSpeech excel in language-specific tasks but are less adaptable. Using Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end (E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show that fine-tuning reduced Whisper-Tiny's Character Error Rate (CER) from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing Whisper-Base's CER of 20.2. However, challenges with domain-specific terms remain, highlighting the need for specialized datasets. These findings demonstrate that fine-tuning multilingual models can achieve strong language-specific performance while retaining their flexibility. This approach provides a scalable solution for improving ASR in resource-constrained environments and languages with complex writing systems like Japanese.

Efficient Adaptation of Multilingual Models for Japanese ASR

TL;DR

This paper tackles Japanese ASR within multilingual Whisper by fine-tuning on Japanese data using LoRA adaptation and end-to-end training. It leverages four Japanese datasets and SpecAugment to bolster robustness, achieving substantial CER reductions: from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning for Whisper-Tiny, while End-to-End on Whisper-Base reaches about 10.07 CER and Small-End-to-End about 7.38 CER; ReazonSpeech baselines reach as low as 4.62 CER. The findings demonstrate that language-specific specialization of multilingual models can deliver strong per-language performance while retaining cross-language flexibility, enabling scalable, resource-efficient ASR improvements in languages with complex scripts. However, domain-specific terminology remains a challenge, indicating a need for targeted datasets to fully bridge the remaining gaps.

Abstract

This study explores fine-tuning multilingual ASR (Automatic Speech Recognition) models, specifically OpenAI's Whisper-Tiny, to improve performance in Japanese. While multilingual models like Whisper offer versatility, they often lack precision in specific languages. Conversely, monolingual models like ReazonSpeech excel in language-specific tasks but are less adaptable. Using Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end (E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show that fine-tuning reduced Whisper-Tiny's Character Error Rate (CER) from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing Whisper-Base's CER of 20.2. However, challenges with domain-specific terms remain, highlighting the need for specialized datasets. These findings demonstrate that fine-tuning multilingual models can achieve strong language-specific performance while retaining their flexibility. This approach provides a scalable solution for improving ASR in resource-constrained environments and languages with complex writing systems like Japanese.

Paper Structure

This paper contains 10 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Training and evaluation curves for different Rank values of LoRA. We also show the performance of the end-to-end fine-tuning approach. These curves are from fine-tuning the Tiny model.
  • Figure 2: Training and evaluation metrics for Tiny (top row), Base (middle row), and Small (bottom row) models. The columns represent, from left to right: training loss, evaluation loss, training WER, and evaluation CER. For the Tiny and Base models, experiments demonstrate that E2E fine-tuning achieves superior performance. In contrast, the Small model exhibits better convergence with LoRA fine-tuning likely due to its ability to efficiently adapt large parameter sets while maintaining computational efficiency. Refer to subfigures j, k, and l for details.
  • Figure 3: Visualization of SpecAugment: Original Log-Mel spectrum (Left) vs Augmented Spectrogram with Time and Frequency Masking (Right)
  • Figure 4: Evaluation loss comparison between models trained with and without SpecAugment
  • Figure 5: Transcription samples from various models with their phonetic readings and English translations. The phonetic readings are consistent across all models, but differences in Kanji lead to significant variations in semantic meaning.
  • ...and 1 more figures