Table of Contents
Fetching ...

FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

Fabian Ritter-Gutierrez, Md Asif Jalal, Pablo Peso Parada, Karthikeyan Saravanan, Yusun Shul, Minseung Kim, Gun-Woo Lee, Han-Gil Moon

TL;DR

FlowW2N is proposed, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features, and achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work.

Abstract

Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.

FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

TL;DR

FlowW2N is proposed, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features, and achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work.

Abstract

Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.
Paper Structure (18 sections, 5 equations, 3 figures, 3 tables)

This paper contains 18 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: FlowW2N pipeline. Left (Training): The DiT learns a velocity field $\mathbf{v}_\theta(\mathbf{z}_t, t, \mathbf{c})$ conditioned on domain-invariant content features $\mathbf{h}$ from a content encoder (layer $\ell^*$) and speaker embedding $\mathbf{e}_{\text{spk}}$, where $\mathbf{c} = \{\mathbf{e}_{\text{spk}}, \mathbf{h}\}$ represents the conditioning set of content and speaker. Training uses only synthetic whisper-normal pairs. Right (Inference): Starting from Gaussian noise, the ODE is integrated to obtain $\mathbf{z}_1$, which is decoded to normal speech. Domain invariance of content features enables generalization to real whispered speech.
  • Figure 2: Domain invariance analysis. Synthesis Gap (Syn): synthetic vs. real whisper; Modality Gap (Mod): real whisper vs. normal speech.
  • Figure 3: Layer selection analysis. Left: Synthesis Gap (invariance). Center: CCA with word identity. Right: Proposed combined score (invariance $\times$ semantic). Stars mark optimal layers selected.