Table of Contents
Fetching ...

Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis

Théodor Lemerle, Téo Guichoux, Axel Roebel, Nicolas Obin

TL;DR

Lina-Speech introduces Gated Linear Attention (GLA) to deliver linear-time, memory-efficient TTS conditioned on long-context audio while maintaining competitive quality. It further enables multi-sample conditioning via Initial-State Tuning (IST), which learns a low-rank initial state $S_0(\phi)$ to support prefix-free, cross-domain voice style and emotion cloning without changing model weights. Across zero-shot cloning, efficiency, and expressive-tuning experiments, Lina-Speech achieves strong subjective naturalness and speaker similarity while delivering significantly higher inference throughput than self-attention baselines. The work demonstrates that IST is a practical, parameter-efficient strategy for voice cloning across diverse domains, with implications for scalable, real-time TTS and expressive speech synthesis, alongside considerations for ethical use and misuse risk.

Abstract

Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker's prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an Initial-State Tuning (IST) strategy that unlocks the possibility of multiple speech sample conditioning of arbitrary numbers and lengths and provides a comprehensive and efficient strategy for voice cloning and out-of-domain speaking style and emotion adaptation. We demonstrate the effectiveness of this approach for controlling fine-grained characteristics such as prosody and emotion. Code, checkpoints, and demo are freely available: https://github.com/theodorblackbird/lina-speech

Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis

TL;DR

Lina-Speech introduces Gated Linear Attention (GLA) to deliver linear-time, memory-efficient TTS conditioned on long-context audio while maintaining competitive quality. It further enables multi-sample conditioning via Initial-State Tuning (IST), which learns a low-rank initial state to support prefix-free, cross-domain voice style and emotion cloning without changing model weights. Across zero-shot cloning, efficiency, and expressive-tuning experiments, Lina-Speech achieves strong subjective naturalness and speaker similarity while delivering significantly higher inference throughput than self-attention baselines. The work demonstrates that IST is a practical, parameter-efficient strategy for voice cloning across diverse domains, with implications for scalable, real-time TTS and expressive speech synthesis, alongside considerations for ethical use and misuse risk.

Abstract

Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker's prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an Initial-State Tuning (IST) strategy that unlocks the possibility of multiple speech sample conditioning of arbitrary numbers and lengths and provides a comprehensive and efficient strategy for voice cloning and out-of-domain speaking style and emotion adaptation. We demonstrate the effectiveness of this approach for controlling fine-grained characteristics such as prosody and emotion. Code, checkpoints, and demo are freely available: https://github.com/theodorblackbird/lina-speech

Paper Structure

This paper contains 43 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Voice cloning by prompt continuation imposes a trade-off between prompt length and generation length. Neural Codec Language Models based on transformers exhibit quality degradation when generation exceeds the context length $L_{max}$, which is determined by the maximum sequence length seen during training. This creates a trade-off between prompt length and the feasible continuation length, posing a significant challenge for TTS where training samples are typically limited to under 30 seconds.
  • Figure 2: Lina-Speech model. $S^E_t$ and $S^D_t$ are encoder and decoder states at time-step $t$ respectively. These states consist of one matrix per GLA layer and per head. For $t=0$, they default to $\mathbf{0}$ but can be tuned efficiently on a specific speaker or style. Initial-state tuning consists of replacing the initial 0 by means of an initial state that is learned using a soft prompt while freezing the models parameters $\theta$.
  • Figure 3: Inference speed comparison between self-attention and gated linear attention. The inference speed was measured on a RTX4090 for varying batch sizes. We compared Lina-Speech against a Self-Attention equivalent model. While self-attention is slightly faster for small batch sizes, Lina-Speech benefits from a much higher inference throughput.
  • Figure 4: Initial-State Tuning convergence speed. IST converges rapidly, typically within 100 steps, with an average runtime of under 20s on an RTX 4090. Example shown for speaker ex01 with the emotion "sad" from the Expresso dataset. We report training and test losses. We also reported the loss averaged over 16 different prompts (voice continuation) and unconditioned (base model) for comparison.
  • Figure 5: Impact of the rank and learning rate for initial-state tuning on the test loss, from left to right on Expresso, TED-LIUM and LibriTTS datasets. For each dataset we report the best test loss averaged over 20 random speakers/styles. In particular, initial-state parameterized as a rank-one matrix performs best on TED-LIUM hernandez2018ted and LibriTTS and is close to the best rank on Expresso. Notably, the optimal learning rate does not vary across datasets.