Table of Contents
Fetching ...

Fine-tuning Whisper on Low-Resource Languages for Real-World Applications

Vincenzo Timmel, Claudio Paonessa, Reza Kakooee, Manfred Vogel, Daniel Perruchoud

TL;DR

The paper tackles the challenge of fine-tuning OpenAI's Whisper for low-resource languages by converting sentence-level Swiss German data into realistic long-form audio for robust segmentation and transcription. A novel data-generation pipeline (timestamp correction, noise overlap, and speaker retention) paired with careful training and diverse datasets yields a state-of-the-art Swiss German STT model, while highlighting the importance of long-form data and data diversity to maintain performance across distributions. By augmenting with pseudo-labeled SRG data and German Common Voice, the approach mitigates forgetting and improves generalization, achieving strong results across multiple benchmarks and dialects. The work provides a practical framework and code to extend Whisper fine-tuning to other low-resource languages, with implications for real-world applications like subtitles, doctor-patient transcripts, and conversational AI.

Abstract

This paper presents a new approach to fine-tuning OpenAI's Whisper model for low-resource languages by introducing a novel data generation method that converts sentence-level data into a long-form corpus, using Swiss German as a case study. Non-sentence-level data, which could improve the performance of long-form audio, is difficult to obtain and often restricted by copyright laws. Our method bridges this gap by transforming more accessible sentence-level data into a format that preserves the model's ability to handle long-form audio and perform segmentation without requiring non-sentence-level data. Our data generation process improves performance in several real-world applications and leads to the development of a new state-of-the-art speech-to-text (STT) model for Swiss German. We compare our model with a non-fine-tuned Whisper and our previous state-of-the-art Swiss German STT models, where our new model achieves higher BLEU scores. Our results also indicate that the proposed method is adaptable to other low-resource languages, supported by written guidance and code that allows the creation of fine-tuned Whisper models, which keep segmentation capabilities and allow the transcription of longer audio files using only sentence-level data with high quality.

Fine-tuning Whisper on Low-Resource Languages for Real-World Applications

TL;DR

The paper tackles the challenge of fine-tuning OpenAI's Whisper for low-resource languages by converting sentence-level Swiss German data into realistic long-form audio for robust segmentation and transcription. A novel data-generation pipeline (timestamp correction, noise overlap, and speaker retention) paired with careful training and diverse datasets yields a state-of-the-art Swiss German STT model, while highlighting the importance of long-form data and data diversity to maintain performance across distributions. By augmenting with pseudo-labeled SRG data and German Common Voice, the approach mitigates forgetting and improves generalization, achieving strong results across multiple benchmarks and dialects. The work provides a practical framework and code to extend Whisper fine-tuning to other low-resource languages, with implications for real-world applications like subtitles, doctor-patient transcripts, and conversational AI.

Abstract

This paper presents a new approach to fine-tuning OpenAI's Whisper model for low-resource languages by introducing a novel data generation method that converts sentence-level data into a long-form corpus, using Swiss German as a case study. Non-sentence-level data, which could improve the performance of long-form audio, is difficult to obtain and often restricted by copyright laws. Our method bridges this gap by transforming more accessible sentence-level data into a format that preserves the model's ability to handle long-form audio and perform segmentation without requiring non-sentence-level data. Our data generation process improves performance in several real-world applications and leads to the development of a new state-of-the-art speech-to-text (STT) model for Swiss German. We compare our model with a non-fine-tuned Whisper and our previous state-of-the-art Swiss German STT models, where our new model achieves higher BLEU scores. Our results also indicate that the proposed method is adaptable to other low-resource languages, supported by written guidance and code that allows the creation of fine-tuned Whisper models, which keep segmentation capabilities and allow the transcription of longer audio files using only sentence-level data with high quality.

Paper Structure

This paper contains 14 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of generated long-form training data from sentence-level audios. Although timestamps are available via the length of the audio, they are not displayed here.
  • Figure 2: Illustration of the logical structure for stitching together sentences using VAD and overlap mechanisms. With the help of a VAD model, we precisely mark the start and end of speech. This allows us to vary the length of pauses between sentence and even introduce an overlap.
  • Figure 3: BLEU score on the STT4SG-350 test set vs. amount of training data (given in Table \ref{['tab:datasets']}) used for fine-tuning. The model evaluated at 0 hours of training data corresponds to the original Whisper Large-v2. The SOTA model is discussed in section \ref{['sec:overcome_shortcomings']}.