Table of Contents
Fetching ...

LoRP-TTS: Low-Rank Personalized Text-To-Speech

Łukasz Bondaruk, Jakub Kubiak

TL;DR

LoRP addresses the challenge of creating truly diverse speech corpora by enabling high-fidelity, speaker-specific TTS from single, noisy prompts. It introduces Low-Rank Personalization (LoRP), which applies Low-Rank Adaptation ($r=16$, $\alpha=16$) to Voicebox, adding about $10^7$ parameters (≈$2.3\%$ of weights) and requiring $100$ optimizer steps per prompt. Across multilingual pretraining plus Polish fine-tuning, LoRP delivers up to $30pp$ gains in speaker similarity while maintaining content and naturalness, with strong generalization across Clarin, Fleurs, Nemo, and Kretes. The approach reduces data collection costs and points toward cross-lingual and expressive prosody enhancements, offering practical impact for diverse speech applications.

Abstract

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

LoRP-TTS: Low-Rank Personalized Text-To-Speech

TL;DR

LoRP addresses the challenge of creating truly diverse speech corpora by enabling high-fidelity, speaker-specific TTS from single, noisy prompts. It introduces Low-Rank Personalization (LoRP), which applies Low-Rank Adaptation (, ) to Voicebox, adding about parameters (≈ of weights) and requiring optimizer steps per prompt. Across multilingual pretraining plus Polish fine-tuning, LoRP delivers up to gains in speaker similarity while maintaining content and naturalness, with strong generalization across Clarin, Fleurs, Nemo, and Kretes. The approach reduces data collection costs and points toward cross-lingual and expressive prosody enhancements, offering practical impact for diverse speech applications.

Abstract

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

Paper Structure

This paper contains 16 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Evaluation metrics for various numbers of samples and optimizer steps on Kretes dataset.
  • Figure 2: Evaluation metrics across different datasets.
  • Figure 3: Synthesis training pipeline.