Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Neeraj Gaur; Zhong Meng

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng

TL;DR

This paper finds that optimizing speech prefixes leads to better ASR performance and proposes applying RNNT loss to perform speech prefix-tuning, which results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM.

Abstract

In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 2 figures, 5 tables)

This paper contains 17 sections, 3 equations, 2 figures, 5 tables.

Introduction
Methodology
PrefixLM
Prefix-tuning
RNNT decoder
Speech prefix-tuning with RNNT loss
Language ID based soft prompting
LLM training
Experiments
Datasets and Models
Results and Discussion
PrefixLM with CTC and RNNT
Language-based Prompting allows the LM to be frozen
Error analysis of speech prefix-tuning
Code-switching analysis
...and 2 more sections

Figures (2)

Figure 1: Previous works [a] and [b] denotes the baseline prefixLM and soft prompting with prefixLM respectively. Subfigures [c] and [d] are our proposed approaches representing the prefix-tuning with RNNT loss and langID based soft prompting respectively
Figure 2: Training and Evaluation flow for PrefixLM with speech prefix-tuning

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

TL;DR

Abstract

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

Authors

TL;DR

Abstract

Table of Contents

Figures (2)