Table of Contents
Fetching ...

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models

Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet

TL;DR

The paper addresses the degradation of post-training quantization (PTQ) for large speech foundation models caused by activation and weight outliers. It combines knowledge distillation to create compact Whisper student models with gated attention to mitigate outliers, enabling reliable $INT8$ quantization. The approach yields substan-tial resilience of WER under quantization, particularly for a 24-layer encoder with gating, and demonstrates improved outlier statistics (e.g., reduced kurtosis and $||\cdot||_{\infty}$) relative to ungated baselines. This work advances practical deployment of efficient, quantized speech foundation models on devices with limited compute and memory, by linking outlier mitigation in attention with robust PTQ performance.

Abstract

This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present when transformer-based models are trained to perform automatic speech recognition, necessitating mitigation strategies for PTQ. We show that outliers can be reduced by a recently proposed gating mechanism in the attention blocks of the student model, enabling effective 8-bit quantization, and lower word error rates compared to student models without the gating mechanism in place.

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models

TL;DR

The paper addresses the degradation of post-training quantization (PTQ) for large speech foundation models caused by activation and weight outliers. It combines knowledge distillation to create compact Whisper student models with gated attention to mitigate outliers, enabling reliable quantization. The approach yields substan-tial resilience of WER under quantization, particularly for a 24-layer encoder with gating, and demonstrates improved outlier statistics (e.g., reduced kurtosis and ) relative to ungated baselines. This work advances practical deployment of efficient, quantized speech foundation models on devices with limited compute and memory, by linking outlier mitigation in attention with robust PTQ performance.

Abstract

This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present when transformer-based models are trained to perform automatic speech recognition, necessitating mitigation strategies for PTQ. We show that outliers can be reduced by a recently proposed gating mechanism in the attention blocks of the student model, enabling effective 8-bit quantization, and lower word error rates compared to student models without the gating mechanism in place.
Paper Structure (14 sections, 5 equations, 2 figures, 2 tables)

This paper contains 14 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Behavior of the self-attention mechanism in the pretrained Whisper model (whisper-large-v2) computed for the first example of the LibriSpeech test-clean set. The left matrix shows the attention probabilities $\mathbf{P}$, where $d=64$ is the dimensionality of the attention head. The middle matrix are the values $\mathbf{V}$ in the twelfth attention head. The right matrix is the product of the two.
  • Figure 2: Top 10 shares of activation outliers per hidden dimension at the output projection of the self-attention block of the last decoder layer in trained student models. Left are the relative outliers for a student trained without gated attention and right are the relative outliers with gated attention. Both models were trained using 24 layers for the encoder. The hidden dimensions are zero-indexed. Each output projection layer has a dimensionality of 1280.