Table of Contents
Fetching ...

RWKVTTS: Yet another TTS based on RWKV-7

Lin yueyu, Liu Xiao

TL;DR

This paper investigates replacing transformer-based LLMs with the RWKV-7 RNN-based model in TTS pipelines, focusing on CosyVoice 2.0 integration to achieve efficient, expressive speech synthesis. It presents a two-stage RWKVTTS system that tokenizes audio via VQ-VAE and uses RWKV-7 to generate audio tokens from text and reference audio, with CosyVoice 2.0–specific data layouts and embeddings. The authors report that RWKVTTS delivers high-quality speech with competitive metrics compared to Ground Truth and transformer-based baselines, while offering improvements in computational efficiency and scalability, particularly for resource-constrained scenarios. They also discuss limitations in capturing production complexity and generalizability, and outline future directions such as streaming generation, explicit prosody controls, dialect handling, and advanced adaptation techniques to broaden applicability. Overall, the work demonstrates the viability of RWKV-7 as a practical, efficient alternative for TTS, potentially democratizing access to high-fidelity voice synthesis across languages and platforms.

Abstract

Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \cite{peng2025rwkv}, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world applications.Our code and weights are https://github.com/yynil/RWKVTTS, https://huggingface.co/spaces/RWKV-Red-Team

RWKVTTS: Yet another TTS based on RWKV-7

TL;DR

This paper investigates replacing transformer-based LLMs with the RWKV-7 RNN-based model in TTS pipelines, focusing on CosyVoice 2.0 integration to achieve efficient, expressive speech synthesis. It presents a two-stage RWKVTTS system that tokenizes audio via VQ-VAE and uses RWKV-7 to generate audio tokens from text and reference audio, with CosyVoice 2.0–specific data layouts and embeddings. The authors report that RWKVTTS delivers high-quality speech with competitive metrics compared to Ground Truth and transformer-based baselines, while offering improvements in computational efficiency and scalability, particularly for resource-constrained scenarios. They also discuss limitations in capturing production complexity and generalizability, and outline future directions such as streaming generation, explicit prosody controls, dialect handling, and advanced adaptation techniques to broaden applicability. Overall, the work demonstrates the viability of RWKV-7 as a practical, efficient alternative for TTS, potentially democratizing access to high-fidelity voice synthesis across languages and platforms.

Abstract

Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \cite{peng2025rwkv}, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world applications.Our code and weights are https://github.com/yynil/RWKVTTS, https://huggingface.co/spaces/RWKV-Red-Team

Paper Structure

This paper contains 18 sections, 3 figures.

Figures (3)

  • Figure 1: Figure 1: Evaluation Metrics Comparison for Ground Truth, RWKVTTS, and FireRedTTS-1S. The bar chart compares the scores of each model across four metrics: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness.
  • Figure 2: The RWKVTTS pipeline, illustrating the flow from input reference audio and prompt text to audio token generation using RWKV-7.
  • Figure 3: The forward pass of the RWKV-7 LLM in the CosyVoice 2.0 system, detailing the data layout and processing steps.