VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

Stanislav Kirdey

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

Stanislav Kirdey

TL;DR

VoiceRestore addresses the challenge of restoring speech recordings degraded by noise, reverberation, compression artifacts, and bandwidth loss across both short and long lengths. It introduces a self-supervised, conditional flow-matching framework built on Transformers, learning to map degraded input $y$ to clean speech $x$ through a continuous-time flow with intermediate states $x_t$ and a neural vector field $v_\theta(x_t, t, y)$ by minimizing $L(\theta)=\mathbb{E}_{t,x,y}[\|u_t(x_t)-v_\theta(x_t,t,y)\|_2^2]$ where $x_t=(1-\alpha(t))y+\alpha(t)x$ and $u_t(x_t)=\dot{\alpha}(t)(x-y)$. The model is trained with synthetic degradations in a self-supervised fashion and handles variable-length sequences via chunking and overlapping windows, employing a Transformer with 768-d embeddings, 20 layers, and 16 heads. Empirical results on English speech restoration and the VoiceBank-DEMAND benchmark show improvements over baselines in both perceptual and objective metrics, highlighting practical benefits for applications from telecommunications to archival preservation and improved ASR performance when used as a preprocessing step. Overall, VoiceRestore advances robust, versatile speech restoration by unifying multiple degradation types and recording lengths under a single, self-supervised framework with a scalable transformer-based architecture.

Abstract

We present VoiceRestore, a novel approach to restoring the quality of speech recordings using flow-matching Transformers trained in a self-supervised manner on synthetic data. Our method tackles a wide range of degradations frequently found in both short and long-form speech recordings, including background noise, reverberation, compression artifacts, and bandwidth limitations - all within a single, unified model. Leveraging conditional flow matching and classifier free guidance, the model learns to map degraded speech to high quality recordings without requiring paired clean and degraded datasets. We describe the training process, the conditional flow matching framework, and the model's architecture. We also demonstrate the model's generalization to real-world speech restoration tasks, including both short utterances and extended monologues or dialogues. Qualitative and quantitative evaluations show that our approach provides a flexible and effective solution for enhancing the quality of speech recordings across varying lengths and degradation types.

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

TL;DR

to clean speech

through a continuous-time flow with intermediate states

and a neural vector field

by minimizing

where

and

. The model is trained with synthetic degradations in a self-supervised fashion and handles variable-length sequences via chunking and overlapping windows, employing a Transformer with 768-d embeddings, 20 layers, and 16 heads. Empirical results on English speech restoration and the VoiceBank-DEMAND benchmark show improvements over baselines in both perceptual and objective metrics, highlighting practical benefits for applications from telecommunications to archival preservation and improved ASR performance when used as a preprocessing step. Overall, VoiceRestore advances robust, versatile speech restoration by unifying multiple degradation types and recording lengths under a single, self-supervised framework with a scalable transformer-based architecture.

Abstract

Paper Structure (21 sections, 3 equations, 5 figures, 2 tables)

This paper contains 21 sections, 3 equations, 5 figures, 2 tables.

Introduction
Related Work
Proposed Method
Problem Formulation
Conditional Flow Matching for Speech Restoration
Mathematical Framework
Self-Supervised Training with Synthetic Degradations
Handling Variable-Length Recordings
Network Architecture
Detailed Architecture Specifications
Training Procedure
Optimizer and Learning Rate
Gradient Accumulation and Mixed Precision
Data Loading and Preprocessing
Degradation Generation
...and 6 more sections

Figures (5)

Figure 1: Conditional Flow Matching Process for Speech Restoration
Figure 2: Synthetic Degradation Pipeline for Self-Supervised Training
Figure 3: Transformer-based Architecture for Conditional Flow Matching
Figure 4: Heavy Distortion and Gain
Figure 5: Heavy Reverberation

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

TL;DR

Abstract

VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)