Table of Contents
Fetching ...

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee

TL;DR

It is found that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts, and that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones.

Abstract

This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.

Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

TL;DR

It is found that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts, and that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones.

Abstract

This research explores how the information of prompts interacts with the high-performing speech recognition model, Whisper. We compare its performances when prompted by prompts with correct information and those corrupted with incorrect information. Our results unexpectedly show that Whisper may not understand the textual prompts in a human-expected way. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages despite the mismatch with pre-training scenarios. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by ignoring incorrect language tokens and focusing on the correct ones. In sum, We raise insightful questions about Whisper's prompt understanding and reveal its counter-intuitive behaviors. We encourage further studies.
Paper Structure (27 sections, 3 figures, 4 tables)

This paper contains 27 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of assessing the understanding of textual prompts and language tokens. Matched prompts are first generated and then corrupted into mismatched prompts by changing the topic information, the prompt language (left), or one of the language tokens into the mismatched ones (right). The performances of Whisper with the matched and mismatched prompts are compared by metrics in Sec. \ref{['metrics']}.
  • Figure 2: Illustration of PERF, BPERF, and TFR. Red squares mark the minimum WER on each subset across all prompts. Green stars mark the WER on each subset when prompted by matched prompts.
  • Figure 3: Linear regression of PERF/BPERF of templates on the corresponding TFR. Points in the figure represent the PERF/BPERF and TFR prompted with specific templates.