PhoWhisper: Automatic Speech Recognition for Vietnamese
Thanh-Thien Le, Linh The Nguyen, Dat Quoc Nguyen
TL;DR
PhoWhisper targets robust Vietnamese ASR by fine-tuning the multilingual Whisper model on a diverse 844-hour dataset spanning multiple sources and 26k speakers. The approach yields five model variants (from PhoWhispertiny to PhoWhisperlarge) with noise-augmented training to improve real-world robustness. Empirical results show state-of-the-art WER across CMV--Vi, VIVOS, and VLSP 2020 benchmarks, with PhoWhisperlarge achieving the best scores on all datasets. The authors release PhoWhisper publicly to provide a strong, reproducible baseline for Vietnamese ASR research and practical applications.
Abstract
We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper's robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: https://github.com/VinAIResearch/PhoWhisper
