Table of Contents
Fetching ...

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

TL;DR

StreamVoice+ addresses the challenge of streaming zero-shot voice conversion by removing reliance on streaming ASR and introducing an ASR-free end-to-end framework built on a non-streaming StreamVoice backbone augmented with a semantic encoder and a residual-bottleneck connector. The approach uses a two-stage training regime with LoRA adapters, plus a self-refinement strategy to improve speech decoupling, achieving real-time-like latency (~112 ms) and competitive or superior naturalness and speaker similarity compared with non-streaming baselines. Key innovations include the R-B connector for robust semantic transmission, the use of a $N$-layer unidirectional semantic encoder with a $k$-step delay, and a dual-mode extension enabling seamless streaming and non-streaming operation. The results demonstrate strong end-to-end streaming performance and data-efficient improvements, highlighting the practical impact for real-time applications such as live broadcasting and online meetings while maintaining versatility across playback modes.

Abstract

StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

TL;DR

StreamVoice+ addresses the challenge of streaming zero-shot voice conversion by removing reliance on streaming ASR and introducing an ASR-free end-to-end framework built on a non-streaming StreamVoice backbone augmented with a semantic encoder and a residual-bottleneck connector. The approach uses a two-stage training regime with LoRA adapters, plus a self-refinement strategy to improve speech decoupling, achieving real-time-like latency (~112 ms) and competitive or superior naturalness and speaker similarity compared with non-streaming baselines. Key innovations include the R-B connector for robust semantic transmission, the use of a -layer unidirectional semantic encoder with a -step delay, and a dual-mode extension enabling seamless streaming and non-streaming operation. The results demonstrate strong end-to-end streaming performance and data-efficient improvements, highlighting the practical impact for real-time applications such as live broadcasting and online meetings while maintaining versatility across playback modes.

Abstract

StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.
Paper Structure (22 sections, 1 figure, 2 tables)

This paper contains 22 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The framework of (a) StreamVoice+, which employs two-stage training procedura: (b) pre-training and (c) fine-tuning, to achieve end-to-end conversion.