Table of Contents
Fetching ...

Multi-Metric Preference Alignment for Generative Speech Restoration

Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu

TL;DR

The paper tackles the misalignment between likelihood-based training and human perceptual preferences in generative speech restoration (GenSR). It introduces a holistic multi-metric preference alignment pipeline, GenSR-Pref, containing roughly 80K unanimous preference pairs across perceptual quality, signal fidelity, content accuracy, and timbre preservation, and optimizes models with Direct Preference Optimization across autoregressive, masked generative, and flow-matching paradigms. Across extensive objective and subjective evaluations, the approach yields consistent improvements for AR, MGM, and FM, with data-efficient gains and robust ablations showing resilience to reward hacking. A practical contribution demonstrates that aligned generative models can serve as high-quality pseudo-labelers to train discriminative models in data-scarce settings, notably singing voice restoration, highlighting the method's practical impact and cross-domain applicability.

Abstract

Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io

Multi-Metric Preference Alignment for Generative Speech Restoration

TL;DR

The paper tackles the misalignment between likelihood-based training and human perceptual preferences in generative speech restoration (GenSR). It introduces a holistic multi-metric preference alignment pipeline, GenSR-Pref, containing roughly 80K unanimous preference pairs across perceptual quality, signal fidelity, content accuracy, and timbre preservation, and optimizes models with Direct Preference Optimization across autoregressive, masked generative, and flow-matching paradigms. Across extensive objective and subjective evaluations, the approach yields consistent improvements for AR, MGM, and FM, with data-efficient gains and robust ablations showing resilience to reward hacking. A practical contribution demonstrates that aligned generative models can serve as high-quality pseudo-labelers to train discriminative models in data-scarce settings, notably singing voice restoration, highlighting the method's practical impact and cross-domain applicability.

Abstract

Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io

Paper Structure

This paper contains 34 sections, 23 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An overview of our multi-metric preference alignment strategy. The process consists of two main stages: (1) constructing the GenSR-Pref preference dataset by ranking model outputs based on a unanimous agreement across multiple metrics, and (2) fine-tuning the model with these preferences using Direct Preference Optimization (DPO).
  • Figure 2: Human AB preference test results for DPO-aligned models on Librivox-GSR testset.
  • Figure 3: Ablation study on training objectives for the MGM model. DPO demonstrates consistent improvements across training steps, while SFT tends to stagnate or degrade. Notably, a naive DPO variant that treats ground-truth outputs as winners ("GT Winner") results in model collapse.
  • Figure 4: Detailed training curves between normal DPO and DPO (GT Winner) training (smoothing factor = 0.99). Using ground truth as the unconditional winner leads to inflated reward margins and saturated reward accuracy, indicating model collapse.
  • Figure 5: AB Test interface for subjective evaluation.
  • ...and 5 more figures