Table of Contents
Fetching ...

Quantizing Whisper-small: How design choices affect ASR performance

Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal

TL;DR

The paper investigates post-training quantization (PTQ) for Whisper-small to enable edge deployment. It conducts a cross-library, cross-method study across four PTQ libraries to disentangle the effects of scheme, granularity, and bit-width on ASR performance using LibriSpeech data. Dynamic int8 quantization emerges as the most reliable option, with Quanto dynamic int8 on GPUs offering substantial model-size reductions (about 57%) while preserving or improving accuracy, and CPU deployments favoring fast inference with PyTorch dynamic int8. These findings provide practical guidance for deploying Whisper-small on constrained hardware without retraining, highlighting that 8-bit precision is a safe baseline and that aggressive low-bit formats should be reserved for memory-constrained scenarios or selective layers.

Abstract

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

Quantizing Whisper-small: How design choices affect ASR performance

TL;DR

The paper investigates post-training quantization (PTQ) for Whisper-small to enable edge deployment. It conducts a cross-library, cross-method study across four PTQ libraries to disentangle the effects of scheme, granularity, and bit-width on ASR performance using LibriSpeech data. Dynamic int8 quantization emerges as the most reliable option, with Quanto dynamic int8 on GPUs offering substantial model-size reductions (about 57%) while preserving or improving accuracy, and CPU deployments favoring fast inference with PyTorch dynamic int8. These findings provide practical guidance for deploying Whisper-small on constrained hardware without retraining, highlighting that 8-bit precision is a safe baseline and that aggressive low-bit formats should be reserved for memory-constrained scenarios or selective layers.

Abstract

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

Paper Structure

This paper contains 12 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: WER increase on test-other relative to test-clean for selected quantized models. Lower-bit-width configurations (nf4, int3) show larger deltas, highlighting the trade-off between compression and robustness.