Table of Contents
Fetching ...

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet

TL;DR

The proposed model extends the OpenAI’s Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency, and is presented as a novel approach designed to enhance processing speed with minimal impact on Word Error Rate.

Abstract

Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI's Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

TL;DR

The proposed model extends the OpenAI’s Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency, and is presented as a novel approach designed to enhance processing speed with minimal impact on Word Error Rate.

Abstract

Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have been proposed to improve performance, including efficient hardware utilization and algorithmic enhancements. In this paper, we introduce Whisper-Medusa, a novel approach designed to enhance processing speed with minimal impact on Word Error Rate (WER). The proposed model extends the OpenAI's Whisper architecture by predicting multiple tokens per iteration, resulting in a 50% reduction in latency. We showcase the effectiveness of Whisper-Medusa across different learning setups and datasets.
Paper Structure (10 sections, 3 equations, 2 figures, 4 tables)

This paper contains 10 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Whisper-Medusa Architectures: Left - Medusa-Linear : each Medusa head consists of a single linear layer with a residual connection, followed by a shared vocabulary projection layer (indicated by a chain symbol). Right - Medusa-Block : A Whisper decoder block shared across all Medusa heads, followed by a single linear layer and a residual connection for each head, with the outputs then passed to a shared vocabulary projection layer (indicated by a chain symbol).
  • Figure 2: Average Speedup Results by Target Token Sequence Length for Czech, Finnish, and Dutch with the Medusa-Linear and Medusa-Block Models.