Table of Contents
Fetching ...

Using RLHF to align speech enhancement approaches to mean-opinion quality scores

Anurag Kumar, Andrew Perrault, Donald S. Williamson

TL;DR

This work proposes a reinforcement learning from human feedback (RLHF) framework to fine-tune an existing speech enhancement approach by optimizing performance using a mean-opinion score (MOS)-based reward model and shows that the RLHF-finetuned model has the best performance across different benchmarks.

Abstract

Objective speech quality measures are typically used to assess speech enhancement algorithms, but it has been shown that they are sub-optimal as learning objectives because they do not always align well with human subjective ratings. This misalignment often results in noticeable distortions and artifacts that cause speech enhancement to be ineffective. To address these issues, we propose a reinforcement learning from human feedback (RLHF) framework to fine-tune an existing speech enhancement approach by optimizing performance using a mean-opinion score (MOS)-based reward model. Our results show that the RLHF-finetuned model has the best performance across different benchmarks for both objective and MOS-based speech quality assessment metrics on the Voicebank+DEMAND dataset. Through ablation studies, we show that both policy gradient loss and supervised MSE loss are important for balanced optimization across the different metrics.

Using RLHF to align speech enhancement approaches to mean-opinion quality scores

TL;DR

This work proposes a reinforcement learning from human feedback (RLHF) framework to fine-tune an existing speech enhancement approach by optimizing performance using a mean-opinion score (MOS)-based reward model and shows that the RLHF-finetuned model has the best performance across different benchmarks.

Abstract

Objective speech quality measures are typically used to assess speech enhancement algorithms, but it has been shown that they are sub-optimal as learning objectives because they do not always align well with human subjective ratings. This misalignment often results in noticeable distortions and artifacts that cause speech enhancement to be ineffective. To address these issues, we propose a reinforcement learning from human feedback (RLHF) framework to fine-tune an existing speech enhancement approach by optimizing performance using a mean-opinion score (MOS)-based reward model. Our results show that the RLHF-finetuned model has the best performance across different benchmarks for both objective and MOS-based speech quality assessment metrics on the Voicebank+DEMAND dataset. Through ablation studies, we show that both policy gradient loss and supervised MSE loss are important for balanced optimization across the different metrics.

Paper Structure

This paper contains 11 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The proposed training framework. A reward is used to calculate the PPO clip loss, which is then combined with an MSE loss to update the SE model.
  • Figure 2: NISQA MOS on VCTK test set recorded every 10 episodes during fine-tuning using $r_{mos}$ as reward function.