Table of Contents
Fetching ...

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang

TL;DR

This work targets improving Whisper's recognition of Kazakh, a low-resource language, by leveraging inexpensive unpaired text and speech data. It introduces a GPT-based decoding augmentation, an end-of-transcript (EOT) judgment modification, and a hallucination penalty, with an ALP-guided selection of unlabeled speech for pseudo-label fine-tuning. The approach yields more than a 10% absolute WER reduction across experiments and demonstrates that GPT integration provides larger gains for smaller Whisper models, while ALP-based sampling enables effective domain adaptation without manual labeling. The methods are generalizable to other under-represented languages and offer a scalable path to integrate language model knowledge into large ASR systems without extensive labeled data.

Abstract

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

TL;DR

This work targets improving Whisper's recognition of Kazakh, a low-resource language, by leveraging inexpensive unpaired text and speech data. It introduces a GPT-based decoding augmentation, an end-of-transcript (EOT) judgment modification, and a hallucination penalty, with an ALP-guided selection of unlabeled speech for pseudo-label fine-tuning. The approach yields more than a 10% absolute WER reduction across experiments and demonstrates that GPT integration provides larger gains for smaller Whisper models, while ALP-based sampling enables effective domain adaptation without manual labeling. The methods are generalizable to other under-represented languages and offer a scalable path to integrate language model knowledge into large ASR systems without extensive labeled data.

Abstract

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.
Paper Structure (15 sections, 3 equations, 3 figures, 4 tables)

This paper contains 15 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Integrating GPT into the decoding process of Whisper.
  • Figure 2: Decoded sample distribution of models on Fleurs-test. The X-axis represents the negative average log probability of the sample's tokens (-ALP), and the Y-axis represents the Word Error Rate (WER) for each sample. The red dashed line separates the samples into two halves based on -ALP.
  • Figure 3: Relationship between the proportion of data selected based on the average token log probability and the WER of the corresponding domain test set.