Table of Contents
Fetching ...

Prompting Large Language Models with Speech Recognition Abilities

Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

TL;DR

The paper tackles enabling multilingual speech recognition by conditioning a decoder-only LLM on audial embeddings produced by a conformer audio encoder. By prepending a variable-length sequence of audial embeddings to the text input and using a frozen or LoRA-tuned LLM (primarily LLaMA-7B), the approach achieves competitive ASR performance on MLS without full model fine-tuning. It demonstrates substantial improvements over monolingual baselines, investigates the effect of audio encoder size and frame rate through comprehensive ablations, and analyzes alignment between audio and text embeddings. The findings suggest that LLMs can be extended to long-form, multilingual speech tasks in a parameter-efficient manner, with potential for explicit alignment objectives to further boost performance.

Abstract

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

Prompting Large Language Models with Speech Recognition Abilities

TL;DR

The paper tackles enabling multilingual speech recognition by conditioning a decoder-only LLM on audial embeddings produced by a conformer audio encoder. By prepending a variable-length sequence of audial embeddings to the text input and using a frozen or LoRA-tuned LLM (primarily LLaMA-7B), the approach achieves competitive ASR performance on MLS without full model fine-tuning. It demonstrates substantial improvements over monolingual baselines, investigates the effect of audio encoder size and frame rate through comprehensive ablations, and analyzes alignment between audio and text embeddings. The findings suggest that LLMs can be extended to long-form, multilingual speech tasks in a parameter-efficient manner, with potential for explicit alignment objectives to further boost performance.

Abstract

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.
Paper Structure (12 sections, 3 figures, 5 tables)

This paper contains 12 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Audio encoder architecture. The initial conformer is trained on a CTC loss. Thereafter the outputs are stacked and projected to the dimension of the LLM to ensure compatibility. This figure showcases a stacking factor of 3 resulting in 240ms embeddings.
  • Figure 2: Model architecture. The embedding sequence generated from the audio encoder is directly prepended to the text embeddings sequence. This is directly fed into the decoder-only LLM, tasked with predicting the next token. The LLM can be frozen, adapted with parameter efficient approaches such as LoRA or fully finetuned. This work will investigate the former two.
  • Figure 3: The pairwise cosine similarity between every pair of audio and text embeddings for a given test example from the English set. The subfigures (a)-(e) represent the models in Table \ref{['tab:main']} with stridings ranging from 80ms up to 960ms.