Fine-tuning Whisper on Low-Resource Languages for Real-World Applications
Vincenzo Timmel, Claudio Paonessa, Reza Kakooee, Manfred Vogel, Daniel Perruchoud
TL;DR
The paper tackles the challenge of fine-tuning OpenAI's Whisper for low-resource languages by converting sentence-level Swiss German data into realistic long-form audio for robust segmentation and transcription. A novel data-generation pipeline (timestamp correction, noise overlap, and speaker retention) paired with careful training and diverse datasets yields a state-of-the-art Swiss German STT model, while highlighting the importance of long-form data and data diversity to maintain performance across distributions. By augmenting with pseudo-labeled SRG data and German Common Voice, the approach mitigates forgetting and improves generalization, achieving strong results across multiple benchmarks and dialects. The work provides a practical framework and code to extend Whisper fine-tuning to other low-resource languages, with implications for real-world applications like subtitles, doctor-patient transcripts, and conversational AI.
Abstract
This paper presents a new approach to fine-tuning OpenAI's Whisper model for low-resource languages by introducing a novel data generation method that converts sentence-level data into a long-form corpus, using Swiss German as a case study. Non-sentence-level data, which could improve the performance of long-form audio, is difficult to obtain and often restricted by copyright laws. Our method bridges this gap by transforming more accessible sentence-level data into a format that preserves the model's ability to handle long-form audio and perform segmentation without requiring non-sentence-level data. Our data generation process improves performance in several real-world applications and leads to the development of a new state-of-the-art speech-to-text (STT) model for Swiss German. We compare our model with a non-fine-tuned Whisper and our previous state-of-the-art Swiss German STT models, where our new model achieves higher BLEU scores. Our results also indicate that the proposed method is adaptable to other low-resource languages, supported by written guidance and code that allows the creation of fine-tuned Whisper models, which keep segmentation capabilities and allow the transcription of longer audio files using only sentence-level data with high quality.
