Table of Contents
Fetching ...

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs

Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

TL;DR

The paper addresses expanding instruction-tuned LLMs with end-to-end speech processing without relying on curated paired data. It introduces AudioChatLlama, a decoder-only multimodal model that freezes the LLM and trains a lightweight audio encoder, using modality-invariant alignment with unpaired ASR data to achieve text-audio parity. Experiments on MLS English show that AudioChatLlama matches or surpasses cascaded ASR+LLM baselines in perplexity and robustness, while enabling cross-modal tasks such as spoken QA, translation, and summarization. Limitations include modest audio understanding and dependence on data quality, with future work aimed at stronger audio encoders and improved segmentation to enhance performance.

Abstract

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results.

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs

TL;DR

The paper addresses expanding instruction-tuned LLMs with end-to-end speech processing without relying on curated paired data. It introduces AudioChatLlama, a decoder-only multimodal model that freezes the LLM and trains a lightweight audio encoder, using modality-invariant alignment with unpaired ASR data to achieve text-audio parity. Experiments on MLS English show that AudioChatLlama matches or surpasses cascaded ASR+LLM baselines in perplexity and robustness, while enabling cross-modal tasks such as spoken QA, translation, and summarization. Limitations include modest audio understanding and dependence on data quality, with future work aimed at stronger audio encoders and improved segmentation to enhance performance.

Abstract

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results.
Paper Structure (16 sections, 2 equations, 5 figures, 4 tables)

This paper contains 16 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Model architecture. The LLM consumes a sequence of embeddings irrespective of the modality and does not differentiate between them. The variable-length and continuous audio embeddings are sandwiched between some prefix and suffix which could contain instructions (and eventually a conversation history) for how the audio prompt should be interpreted. For text-based prompts, the audio encoder is swapped out for the text embedding matrix.
  • Figure 2: The aim is to create an end-to-end system that would generate the same response when being fed spoken (audio) input instead of its text version. The overall system should be invariant to the modality of the inputs containing the same semantic information.
  • Figure 3: Cosine similarity between text and audio embeddings.
  • Figure 4: An example when the cascaded baseline had a prompt WER of 26%. AudioChatLlama demonstrated robust response generation in the face of these ambiguities.
  • Figure 5: A low-error example when AudioChatLlama made mistakes. The cascaded baseline had a WER of 0%, but AudioChatLlama responded wrongly with a name that sounded similar to the correct one.