ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation
Yuezhang Peng, Yuxin Liu, Yao Li, Sheng Wang, Fei Wen, Xie Chen
TL;DR
This work tackles the memory bottleneck of fine-tuning large ASR foundation models by introducing ZO-ASR, a zeroth-order optimization approach that avoids back-propagation and activation storage. By using a q-query gradient estimation (q-RGE) and in-place, forward-only updates, it enables SGD-based fine-tuning with inference-level memory, and extends to Test-Time Adaptation with minimal storage via seeds and gradient projections. Empirical results show strong relative WER improvements over zero-shot baselines on Whisper-Large-V3 for several low-resource languages, while achieving major memory savings; TTA results on Wav2Vec2-Base reveal a trade-off between slight performance loss and substantial BP-free practicality. The paper discusses avenues to further improve convergence, latency, and applicability to edge and quantized deployments, highlighting ZO-ASR as a viable option when back-propagation is infeasible.
Abstract
Fine-tuning pre-trained speech foundation models for Automatic Speech Recognition (ASR) is prevalent, yet constrained by substantial GPU memory requirements. We introduce ZO-ASR, a memory-efficient Zeroth-Order (ZO) method that avoids Back-Propagation (BP) and activation memory by estimating gradients via forward passes. When combined with SGD optimizer, ZO-ASR-SGD fine-tunes ASR models using only inference memory. Our evaluation spans supervised and unsupervised tasks. For Supervised Domain Adaptation on Whisper-Large-V3, ZO-ASR's multiple query mechanism enhances robustness and achieves up to an 18.9\% relative Word Error Rate reduction over zero-shot baselines, outperforming existing ZO methods. For unsupervised Test-Time Adaptation on Wav2Vec2-Base, ZO-ASR exhibits moderately lower performance compared to first-order optimizer Adam. Our BP-free approach provides a viable solution for fine-tuning ASR models in computationally resource-constrained or gradient-inaccessible scenarios.
