Table of Contents
Fetching ...

Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models

Lucas Rakotoarivony

Abstract

Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.

Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models

Abstract

Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
Paper Structure (14 sections, 5 equations, 2 figures, 3 tables)

This paper contains 14 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of quantization behavior across audio (Conformer gulati2020conformer), vision (ResNet he2016deep), and NLP (BERT devlin2019bert) models. Left: Cumulative distribution of normalized activation values, showing an approximately uniform distribution for ResNet, a rapidly saturating distribution for BERT, and a highly compressed distribution for Conformer. Right: Relative performance under weight and activation quantization using max calibration. While all models maintain good performance with 4-bit weight quantization, 4-bit activation quantization severely degrades performance for Conformer, unlike ResNet and BERT.
  • Figure 2: Overview of the proposed ESC method. First, each layer-wise activation scaling factor is locally optimized by minimizing the MSE between the FP32 and quantized layer outputs. Then, all scaling factors are jointly refined using the CMA-ES algorithm to minimizes the task-specific error between the quantized model output $\hat{y}$ and the target $y$.