Table of Contents
Fetching ...

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Yisi Liu, Nicholas Lee, Gopala Anumanchipalli

TL;DR

StyleStream is proposed, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance and enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.

StyleStream: Real-Time Zero-Shot Voice Style Conversion

TL;DR

StyleStream is proposed, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance and enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.
Paper Structure (30 sections, 6 equations, 3 figures, 9 tables)

This paper contains 30 sections, 6 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: System overview of StyleStream. The Destylizer extracts content features disentangled from style, and the Stylizer generates speech that preserves the source linguistic content while adopting the target timbre, accent, and emotion.
  • Figure 2: Destylizer architecture. The Destylizer, as part of the ASR encoder, is trained with a sequence-to-sequence ASR loss. The continuous representations immediately before the FSQ module are taken as content features.
  • Figure 3: Stylizer architecture. The Stylizer contains a style encoder and a diffusion transformer.