Table of Contents
Fetching ...

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting Lu, Mike Seltzer, Qing He

TL;DR

This work introduces a text-to-speech (TTS) system powered by a fine-tuned Llama model that achieves state-of-the-art speech synthesis performance and proposes MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

TL;DR

This work introduces a text-to-speech (TTS) system powered by a fine-tuned Llama model that achieves state-of-the-art speech synthesis performance and proposes MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.

Abstract

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Paper Structure

This paper contains 25 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of TTS-Llama. The core engine of the TTS system is the fine-tuned Llama model, which extracts high-level semantic information from the input text. An acoustic model, conditioned on this semantic information, further extracts low-level acoustic features for speech synthesis.
  • Figure 2: Overview of MoLE-Llama. MoLE-Llama is trained using a late-fusion approach consisting of three stages: Stage-1: Inject speech modality by fine-tuning a text-based Llama3-8B model for the TTS task; Stage-2: Preserve model’s text capabilities by continuously fine-tuning the LoRA adapter using text instruct-tuning data; Stage-3: Unify the text and speech LoRA experts into a single multimodal LLM using the mixture-of-LoRA experts technique. MoLE-Llama can be extended to address additional tasks, such as speech QA, by training an extra speech QA LoRA expert during Stage-2 (see Section \ref{['subsec:speech-qa']}).