Table of Contents
Fetching ...

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

Stanislav Kirdey

TL;DR

VoiceRestore addresses the challenge of restoring speech recordings degraded by noise, reverberation, compression artifacts, and bandwidth loss across both short and long lengths. It introduces a self-supervised, conditional flow-matching framework built on Transformers, learning to map degraded input $y$ to clean speech $x$ through a continuous-time flow with intermediate states $x_t$ and a neural vector field $v_\theta(x_t, t, y)$ by minimizing $L(\theta)=\mathbb{E}_{t,x,y}[\|u_t(x_t)-v_\theta(x_t,t,y)\|_2^2]$ where $x_t=(1-\alpha(t))y+\alpha(t)x$ and $u_t(x_t)=\dot{\alpha}(t)(x-y)$. The model is trained with synthetic degradations in a self-supervised fashion and handles variable-length sequences via chunking and overlapping windows, employing a Transformer with 768-d embeddings, 20 layers, and 16 heads. Empirical results on English speech restoration and the VoiceBank-DEMAND benchmark show improvements over baselines in both perceptual and objective metrics, highlighting practical benefits for applications from telecommunications to archival preservation and improved ASR performance when used as a preprocessing step. Overall, VoiceRestore advances robust, versatile speech restoration by unifying multiple degradation types and recording lengths under a single, self-supervised framework with a scalable transformer-based architecture.

Abstract

We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including background noise, reverberation, compression artifacts, and bandwidth limitations - all within a single, unified model. Leveraging conditional flow matching and classifier free guidance, the model learns to map degraded speech to high quality recordings without requiring paired clean and degraded datasets. We describe the training process, the conditional flow matching framework, and the model's architecture. We also demonstrate the model's generalization to real-world speech restoration tasks, including both short utterances and extended monologues or dialogues. Qualitative and quantitative evaluations show that our approach provides a flexible and effective solution for enhancing the quality of speech recordings across varying lengths and degradation types.

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

TL;DR

VoiceRestore addresses the challenge of restoring speech recordings degraded by noise, reverberation, compression artifacts, and bandwidth loss across both short and long lengths. It introduces a self-supervised, conditional flow-matching framework built on Transformers, learning to map degraded input to clean speech through a continuous-time flow with intermediate states and a neural vector field by minimizing where and . The model is trained with synthetic degradations in a self-supervised fashion and handles variable-length sequences via chunking and overlapping windows, employing a Transformer with 768-d embeddings, 20 layers, and 16 heads. Empirical results on English speech restoration and the VoiceBank-DEMAND benchmark show improvements over baselines in both perceptual and objective metrics, highlighting practical benefits for applications from telecommunications to archival preservation and improved ASR performance when used as a preprocessing step. Overall, VoiceRestore advances robust, versatile speech restoration by unifying multiple degradation types and recording lengths under a single, self-supervised framework with a scalable transformer-based architecture.

Abstract

We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including background noise, reverberation, compression artifacts, and bandwidth limitations - all within a single, unified model. Leveraging conditional flow matching and classifier free guidance, the model learns to map degraded speech to high quality recordings without requiring paired clean and degraded datasets. We describe the training process, the conditional flow matching framework, and the model's architecture. We also demonstrate the model's generalization to real-world speech restoration tasks, including both short utterances and extended monologues or dialogues. Qualitative and quantitative evaluations show that our approach provides a flexible and effective solution for enhancing the quality of speech recordings across varying lengths and degradation types.
Paper Structure (21 sections, 3 equations, 5 figures, 2 tables)

This paper contains 21 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Conditional Flow Matching Process for Speech Restoration
  • Figure 2: Synthetic Degradation Pipeline for Self-Supervised Training
  • Figure 3: Transformer-based Architecture for Conditional Flow Matching
  • Figure 4: Heavy Distortion and Gain
  • Figure 5: Heavy Reverberation