A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

Edward Fish; Umberto Michieli; Mete Ozay

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

Edward Fish, Umberto Michieli, Mete Ozay

TL;DR

The paper tackles the problem of deploying large ASR transformers on devices with limited memory by introducing myQASR, a label-free and personalized mixed-precision quantization framework. It combines a fast, layer-wise sensitivity analysis based on median activation statistics with a uniformity constraint to allocate per-layer bit depths under a target memory budget, followed by calibration of weights and activations using three scaling strategies. The approach requires only a handful of unlabelled samples from the target user and avoids fine-tuning, enabling on-device deployment while preserving accuracy for gender, language, and speaker-specific targets. Experimental results on Wav2Vec2 and Whisper across LibriSpeech, FLEURS, and GSC demonstrate meaningful improvements over standard uniform quantization, with strong gains when calibrations align with the target user’s data distribution, suggesting practical impact for inclusive and private, personalized ASR on mobile and edge devices.

Abstract

Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain. To this end, we propose myQASR, a mixed-precision quantization method that generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning. myQASR automatically evaluates the quantization sensitivity of network layers by analysing the full-precision activation values. We are then able to generate a personalised mixed-precision quantization scheme for any pre-determined memory budget. Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

TL;DR

Abstract

Paper Structure (12 sections, 4 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Method
Experimental Analyses
Ablation Study
Conclusions
Datasets Statistics
Additional Details
Implementation Details
Hyper-parameters
Gender Personalization Results
Additional Ablation Results on W2V2-L
Qualitative Model Transcriptions

Figures (6)

Figure 1: Overview of myQASR. A large model is quantized according to users' audio data and their device storage budget.
Figure 2: Distribution of activations from the first convolution layer of Wav2Vec2 on female (F) and male (M) data.
Figure 3: WER of W2V2-B on LS-F. Original model is 360MB.
Figure S1: Number of test samples in FLEURS for each language.
Figure S2: Number of test samples per speaker of Google Speech Commands. Speaker IDs are shown as reported in main paper. Corresponding ID's are shown in Sec. \ref{['sec:data_stats']}
...and 1 more figures

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

TL;DR

Abstract

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)