StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu; Nicholas Lee; Gopala Anumanchipalli

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu, Nicholas Lee, Gopala Anumanchipalli

TL;DR

StyleStream is proposed, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance and enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.

StyleStream: Real-Time Zero-Shot Voice Style Conversion

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 3 figures, 9 tables)

This paper contains 30 sections, 6 equations, 3 figures, 9 tables.

Introduction
Related Work
Voice Style Cloning
Speech Content Disentanglement
Real-Time Voice Conversion
Method
Destylizer: Content-Style Disentanglement
Stylizer: Stylized Acoustic Modeling
Vocoder
Real-Time Design
Experimental Setup
Dataset
Training
Baselines
Metrics
...and 15 more sections

Figures (3)

Figure 1: System overview of StyleStream. The Destylizer extracts content features disentangled from style, and the Stylizer generates speech that preserves the source linguistic content while adopting the target timbre, accent, and emotion.
Figure 2: Destylizer architecture. The Destylizer, as part of the ASR encoder, is trained with a sequence-to-sequence ASR loss. The continuous representations immediately before the FSQ module are taken as content features.
Figure 3: Stylizer architecture. The Stylizer contains a style encoder and a diffusion transformer.

StyleStream: Real-Time Zero-Shot Voice Style Conversion

TL;DR

Abstract

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)