Multi-Metric Preference Alignment for Generative Speech Restoration

Junan Zhang; Xueyao Zhang; Jing Yang; Yuancheng Wang; Fan Fan; Zhizheng Wu

Multi-Metric Preference Alignment for Generative Speech Restoration

Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu

TL;DR

The paper tackles the misalignment between likelihood-based training and human perceptual preferences in generative speech restoration (GenSR). It introduces a holistic multi-metric preference alignment pipeline, GenSR-Pref, containing roughly 80K unanimous preference pairs across perceptual quality, signal fidelity, content accuracy, and timbre preservation, and optimizes models with Direct Preference Optimization across autoregressive, masked generative, and flow-matching paradigms. Across extensive objective and subjective evaluations, the approach yields consistent improvements for AR, MGM, and FM, with data-efficient gains and robust ablations showing resilience to reward hacking. A practical contribution demonstrates that aligned generative models can serve as high-quality pseudo-labelers to train discriminative models in data-scarce settings, notably singing voice restoration, highlighting the method's practical impact and cross-domain applicability.

Abstract

Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io

Multi-Metric Preference Alignment for Generative Speech Restoration

TL;DR

Abstract

Multi-Metric Preference Alignment for Generative Speech Restoration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)