Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Maohao Shen; Shun Zhang; Jilong Wu; Zhiping Xiu; Ehab AlBadawy; Yiting Lu; Mike Seltzer; Qing He

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting Lu, Mike Seltzer, Qing He

TL;DR

This work introduces a text-to-speech (TTS) system powered by a fine-tuned Llama model that achieves state-of-the-art speech synthesis performance and proposes MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

TL;DR

Abstract

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)