Table of Contents
Fetching ...

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision

Zhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu

TL;DR

This work tackles the challenge of deploying Conformer-based ASR systems under extreme memory and compute constraints by pursuing 2-bit and 1-bit weight quantization. It proposes a quantization-aware training framework that combines tensor-wise learnable scaling, quantization co-training, KL-divergence regularization, and stochastic precision to bridge the performance gap between ultra-low-bit and full-precision models. The authors demonstrate lossless quantization for both 2-bit and 1-bit Conformer models on Switchboard and LibriSpeech, achieving up to 16.2x–16.6x overall compression while maintaining statistically indistinguishable WER from the full-precision baselines. The approach shares weights across bit-widths, requires only negligible extra quantization parameters, and outperforms existing ASR quantization methods, offering practical impact for resource-constrained deployment and potential applicability to other quantization settings.

Abstract

Model compression has become an emerging need as the sizes of modern speech systems rapidly increase. In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. We propose novel approaches to perform extremely low-bit (i.e., 2-bit and 1-bit) quantization of Conformer automatic speech recognition systems using multiple precision model co-training, stochastic precision, and tensor-wise learnable scaling factors to alleviate quantization incurred performance loss. The proposed methods can achieve performance-lossless 2-bit and 1-bit quantization of Conformer ASR systems trained with the 300-hr Switchboard and 960-hr LibriSpeech corpus. Maximum overall performance-lossless compression ratios of 16.2 and 16.6 times are achieved without a statistically significant increase in the word error rate (WER) over the full precision baseline systems, respectively.

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision

TL;DR

This work tackles the challenge of deploying Conformer-based ASR systems under extreme memory and compute constraints by pursuing 2-bit and 1-bit weight quantization. It proposes a quantization-aware training framework that combines tensor-wise learnable scaling, quantization co-training, KL-divergence regularization, and stochastic precision to bridge the performance gap between ultra-low-bit and full-precision models. The authors demonstrate lossless quantization for both 2-bit and 1-bit Conformer models on Switchboard and LibriSpeech, achieving up to 16.2x–16.6x overall compression while maintaining statistically indistinguishable WER from the full-precision baselines. The approach shares weights across bit-widths, requires only negligible extra quantization parameters, and outperforms existing ASR quantization methods, offering practical impact for resource-constrained deployment and potential applicability to other quantization settings.

Abstract

Model compression has become an emerging need as the sizes of modern speech systems rapidly increase. In this paper, we study model weight quantization, which directly reduces the memory footprint to accommodate computationally resource-constrained applications. We propose novel approaches to perform extremely low-bit (i.e., 2-bit and 1-bit) quantization of Conformer automatic speech recognition systems using multiple precision model co-training, stochastic precision, and tensor-wise learnable scaling factors to alleviate quantization incurred performance loss. The proposed methods can achieve performance-lossless 2-bit and 1-bit quantization of Conformer ASR systems trained with the 300-hr Switchboard and 960-hr LibriSpeech corpus. Maximum overall performance-lossless compression ratios of 16.2 and 16.6 times are achieved without a statistically significant increase in the word error rate (WER) over the full precision baseline systems, respectively.

Paper Structure

This paper contains 14 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Diagram of quantization co-training framework with stochastic precision. When training, for each utterance, the 1-bit model is binarized from the 2-bit quantized model. The stochastic precision model is sampled by randomly binarizing partial layers of the 2-bit model. The three models share weight parameters. The training signal for 1-bit or stochastic precision model mixes the model loss and the KL-divergence regularization, which are calculated from real transcripts and the logits of the 2-bit model, respectively.
  • Figure 2: Accuracy ($\uparrow$) of different 1-bit (a) and 2-bit (b) conformer systems on the validation set as a function of training epochs. "int1" and "int2" represent naively quantized 1-bit and 2-bit conformer systems, respectively. "CNN$_f$" denotes keeping the convolution modules with full-precision. "Sc." denotes using learnable scaling factors. "Co-T.", "KL", and "SP" are for techniques of co-training, KL-regularization, and stochastic precision, respectively. The results are plot every 5 epochs when training. The following ID numbers denote the correspondence with the results in Tables \ref{['tab1']} and \ref{['tab2']}.