Table of Contents
Fetching ...

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Yan Rong, Li Liu

TL;DR

This work proposes an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion.

Abstract

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. Project website with audio samples and code can be found at https://id-facevc.github.io.

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

TL;DR

This work proposes an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion.

Abstract

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. Project website with audio samples and code can be found at https://id-facevc.github.io.
Paper Structure (32 sections, 8 equations, 9 figures, 3 tables)

This paper contains 32 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (a) Traditional voice conversion (VC) paradigm. (b) Our novel ZS-FVC paradigm, which accepts either audio or text as input and allows control over the emotional tone and speed of the generated speech.
  • Figure 2: Overview of the proposed ID-FaceVC. The Adapter is a Feed Forward Network used to adjust vector dimensions. The embeddings $F_{f}$, $F_{spk}^{’}$, $F_{con}^{’}$ correspond to the face, speaker, and content features extracted by $E_f$, $E_{spk}$, $E_{con}$, respectively.
  • Figure 3: The inference stage of ID-FaceVC. Text is introduced as an alternative modality to produce natural, rhythmic, and controllable speech.
  • Figure 4: Mel-spectrogram visualizations of voices generated from text inputs under different emotions and speeds. The red boxes highlight areas with significant changes compared with the "calm" state.
  • Figure 5: The mel-spectrograms of voices generated by mixed facial embeddings. From left to right, as the weight of male facial embeddings increases, the voice characteristics gradually shift from female to male, and the fundamental frequency decreases.
  • ...and 4 more figures