Table of Contents
Fetching ...

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

Kumud Tripathi, Raj Gothi, Pankaj Wasnik

TL;DR

This work tackles the low-resource challenge of Malicious Whisper-based ASR for Indian languages by introducing two complementary techniques: language-family prompt-tuning and a custom Indian-language tokenizer. The former leverages linguistic similarities within Indo-Aryan and Dravidian families via prompts added to the decoder, while the latter expands the token set with language-specific BPE tokens to boost decoding efficiency. Experiments on Kathbath (IndicSUPERB) across eight languages show that language-family prompting achieves the best WER improvements, the tokenizer yields the fastest inference, and combining both yields a favorable balance between accuracy and speed. The results demonstrate a practical pathway to make multilingual ASR more accurate and real-time capable for diverse Indian languages, with potential applicability to additional low-resource languages and dialects.

Abstract

Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.

Enhancing Whisper's Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization

TL;DR

This work tackles the low-resource challenge of Malicious Whisper-based ASR for Indian languages by introducing two complementary techniques: language-family prompt-tuning and a custom Indian-language tokenizer. The former leverages linguistic similarities within Indo-Aryan and Dravidian families via prompts added to the decoder, while the latter expands the token set with language-specific BPE tokens to boost decoding efficiency. Experiments on Kathbath (IndicSUPERB) across eight languages show that language-family prompting achieves the best WER improvements, the tokenizer yields the fastest inference, and combining both yields a favorable balance between accuracy and speed. The results demonstrate a practical pathway to make multilingual ASR more accurate and real-time capable for diverse Indian languages, with potential applicability to additional low-resource languages and dialects.

Abstract

Automatic speech recognition has recently seen a significant advancement with large foundational models such as Whisper. However, these models often struggle to perform well in low-resource languages, such as Indian languages. This paper explores two novel approaches to enhance Whisper's multilingual speech recognition performance in Indian languages. First, we propose prompt-tuning with language family information, which enhances Whisper's accuracy in linguistically similar languages. Second, we introduce a novel tokenizer that reduces the number of generated tokens, thereby accelerating Whisper's inference speed. Our extensive experiments demonstrate that the tokenizer significantly reduces inference time, while prompt-tuning enhances accuracy across various Whisper model sizes, including Small, Medium, and Large. Together, these techniques achieve a balance between optimal WER and inference speed.
Paper Structure (9 sections, 1 figure, 5 tables)

This paper contains 9 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Block diagram of the proposed architecture. Yellow color represents the proposed approaches.