Table of Contents
Fetching ...

MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction

Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li, Raj Dabre, Yi Zhao, Tatsuya Kawahara

TL;DR

This work addresses the challenge of fake audio detection by leveraging automatic MOS prediction as an auxiliary quality signal. It introduces MOS-FAD, a three-part framework combining SSL-based FAD predictors (SSL-FAD), MOS predictors (Fused SSL-MOS), and a MOS-informed fusion mechanism (MOS-FAD Fusion) that gates FAD information with MOS scores; a MOS-based data-filter further improves training by balancing real/fake samples. On ASVspoof benchmarks, MOS-FAD achieves state-of-the-art results, with gating-based fusion yielding a substantial relative reduction in EER (up to 13.6%). The approach demonstrates that a fine-grained, MOS-informed perspective enhances robustness against synthetic speech and can help mitigate misuse of voice synthesis technologies.

Abstract

Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection and model fusion. In training data selection, we demonstrate that MOS enables effective filtering of samples from unbalanced datasets. In the model fusion, our results demonstrate that incorporating MOS as a gating mechanism in FAD model fusion enhances overall performance.

MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction

TL;DR

This work addresses the challenge of fake audio detection by leveraging automatic MOS prediction as an auxiliary quality signal. It introduces MOS-FAD, a three-part framework combining SSL-based FAD predictors (SSL-FAD), MOS predictors (Fused SSL-MOS), and a MOS-informed fusion mechanism (MOS-FAD Fusion) that gates FAD information with MOS scores; a MOS-based data-filter further improves training by balancing real/fake samples. On ASVspoof benchmarks, MOS-FAD achieves state-of-the-art results, with gating-based fusion yielding a substantial relative reduction in EER (up to 13.6%). The approach demonstrates that a fine-grained, MOS-informed perspective enhances robustness against synthetic speech and can help mitigate misuse of voice synthesis technologies.

Abstract

Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection and model fusion. In training data selection, we demonstrate that MOS enables effective filtering of samples from unbalanced datasets. In the model fusion, our results demonstrate that incorporating MOS as a gating mechanism in FAD model fusion enhances overall performance.
Paper Structure (17 sections, 3 figures, 3 tables)

This paper contains 17 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Proposed model structure.
  • Figure 2: Training process of our proposed model.
  • Figure 3: The MOS score distributions in ASVspoof2019 LA track.