Table of Contents
Fetching ...

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si

TL;DR

Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency, and the proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning.

Abstract

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

TL;DR

Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency, and the proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning.

Abstract

The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
Paper Structure (9 sections, 4 equations, 3 figures, 1 table)

This paper contains 9 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Three attention architectures: (a) Original MHA in Whisper (left), (b) Full low-rank compression MLA (middle), (c) Dimension-preserving MLA (right).
  • Figure 2: The method of converting Whisper to Whisper-MLA.
  • Figure 3: Comparison of GPU memory consumption between Whisper and Whisper-MLA across different batch sizes and sequence lengths