Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis
Théodor Lemerle, Téo Guichoux, Axel Roebel, Nicolas Obin
TL;DR
Lina-Speech introduces Gated Linear Attention (GLA) to deliver linear-time, memory-efficient TTS conditioned on long-context audio while maintaining competitive quality. It further enables multi-sample conditioning via Initial-State Tuning (IST), which learns a low-rank initial state $S_0(\phi)$ to support prefix-free, cross-domain voice style and emotion cloning without changing model weights. Across zero-shot cloning, efficiency, and expressive-tuning experiments, Lina-Speech achieves strong subjective naturalness and speaker similarity while delivering significantly higher inference throughput than self-attention baselines. The work demonstrates that IST is a practical, parameter-efficient strategy for voice cloning across diverse domains, with implications for scalable, real-time TTS and expressive speech synthesis, alongside considerations for ethical use and misuse risk.
Abstract
Neural codec language models, built on transformer architecture, have revolutionized text-to-speech (TTS) synthesis, excelling in voice cloning by treating it as a prefix continuation task. However, their limited context length hinders their effectiveness to short speech samples. As a result, the voice cloning ability is restricted to a limited coverage and diversity of the speaker's prosody and style. Besides, adapting prosody, accent, or appropriate emotion from a short prefix remains a challenging task. Finally, the quadratic complexity of self-attention limits inference throughput. In this work, we introduce Lina-Speech, a TTS model with Gated Linear Attention (GLA) to replace standard self-attention as a principled backbone, improving inference throughput while matching state-of-the-art performance. Leveraging the stateful property of recurrent architecture, we introduce an Initial-State Tuning (IST) strategy that unlocks the possibility of multiple speech sample conditioning of arbitrary numbers and lengths and provides a comprehensive and efficient strategy for voice cloning and out-of-domain speaking style and emotion adaptation. We demonstrate the effectiveness of this approach for controlling fine-grained characteristics such as prosody and emotion. Code, checkpoints, and demo are freely available: https://github.com/theodorblackbird/lina-speech
