Table of Contents
Fetching ...

SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang

TL;DR

The paper addresses the challenge of evaluating speaker voice similarity in voice conversion by leveraging pre-trained speech foundation models (SFMs). It introduces SVSNet+, which fuses layer-wise SFM representations via a weighted-sum, aligns inputs with a co-attention module, and predicts similarity through a distance-based predictor, trained on VCC2018/2020 data. Results show that SVSNet+ with WavLM-Large and other SFMs improves system-level metrics such as $LCC$, $SRCC$, and $MSE$ compared to the baseline, with the weighted-sum approach and avoiding fine-tuning providing robust gains. The work demonstrates the practical potential of incorporating SFMs into speaker similarity tasks and suggests further exploration of multi-SFM fusion for enhanced generalization across datasets and languages.

Abstract

Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.

SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

TL;DR

The paper addresses the challenge of evaluating speaker voice similarity in voice conversion by leveraging pre-trained speech foundation models (SFMs). It introduces SVSNet+, which fuses layer-wise SFM representations via a weighted-sum, aligns inputs with a co-attention module, and predicts similarity through a distance-based predictor, trained on VCC2018/2020 data. Results show that SVSNet+ with WavLM-Large and other SFMs improves system-level metrics such as , , and compared to the baseline, with the weighted-sum approach and avoiding fine-tuning providing robust gains. The work demonstrates the practical potential of incorporating SFMs into speaker similarity tasks and suggests further exploration of multi-SFM fusion for enhanced generalization across datasets and languages.

Abstract

Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.
Paper Structure (20 sections, 5 equations, 1 figure, 4 tables)

This paper contains 20 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: (a) The architecture of SVSNet+. P, L, CAT, Dis, and Pred respectively denote the pre-trained model, linear layer, co-attention module, distance module, and prediction module. (b) The pre-trained model followed by the weighted sum module.