Table of Contents
Fetching ...

Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

TL;DR

This paper addresses memory-efficient ASR rescoring by applying LoRA to a pretrained language model in a rescoring setup. It systematically compares vanilla LoRA, dynamic rank allocation, high-rank warm-up, and mixed-rank training, revealing relative WER improvements on Librispeech and internal data. It also introduces NPRR, a metric to quantify robustness of N-best rescoring under phonetics-based perturbations, and shows LoRA variants generally lag behind fully fine-tuned models in robustness. The findings highlight a trade-off between compute savings and adversarial robustness, guiding practical deployment and future robustness-enhancement work.

Abstract

The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.

Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

TL;DR

This paper addresses memory-efficient ASR rescoring by applying LoRA to a pretrained language model in a rescoring setup. It systematically compares vanilla LoRA, dynamic rank allocation, high-rank warm-up, and mixed-rank training, revealing relative WER improvements on Librispeech and internal data. It also introduces NPRR, a metric to quantify robustness of N-best rescoring under phonetics-based perturbations, and shows LoRA variants generally lag behind fully fine-tuned models in robustness. The findings highlight a trade-off between compute savings and adversarial robustness, guiding practical deployment and future robustness-enhancement work.

Abstract

The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in -best perturbation, they alleviate the degradation in -best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
Paper Structure (19 sections, 3 equations, 2 figures, 4 tables)

This paper contains 19 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Two improved training strategies for LoRA-based ASR language modeling: (a) dynamic rank allocation and (b) mixed-rank training. For mixed-rank training: full rank training is marked by an elephant icon, dynamic rank allocation is marked by robots, and the very low-rank fine-tuning is marked by a mouse.
  • Figure 2: Proposed N-best evaluation for robust ASR Rescoring.