Table of Contents
Fetching ...

Hear Your Face: Face-based voice conversion with F0 estimation

Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee

TL;DR

The paper addresses face-based voice conversion by deriving the target's pitch cue, $f_{0,\, ext{avg}}$, from facial images and using it to guide a conditional variational autoencoder that transforms source speech to resemble the target's vocal identity without accessing target acoustic data. HYFace integrates a face-based speaker embedding, SSL-based content representations, and a frame-wise $F0$ decoder, with an average-$F0$ estimator trained on faces to handle unseen targets. The authors introduce a pitch-deviation metric to quantify explicit face–voice associations and demonstrate state-of-the-art results on the LRS3 dataset, achieving strong objective and subjective scores while maintaining close alignment to ground-truth pitch. This work advances face-centered speech synthesis, enabling identity-preserving voice generation for speakers lacking vocal data and providing a new avenue for cross-modal identity transfer.

Abstract

This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.

Hear Your Face: Face-based voice conversion with F0 estimation

TL;DR

The paper addresses face-based voice conversion by deriving the target's pitch cue, , from facial images and using it to guide a conditional variational autoencoder that transforms source speech to resemble the target's vocal identity without accessing target acoustic data. HYFace integrates a face-based speaker embedding, SSL-based content representations, and a frame-wise decoder, with an average- estimator trained on faces to handle unseen targets. The authors introduce a pitch-deviation metric to quantify explicit face–voice associations and demonstrate state-of-the-art results on the LRS3 dataset, achieving strong objective and subjective scores while maintaining close alignment to ground-truth pitch. This work advances face-centered speech synthesis, enabling identity-preserving voice generation for speakers lacking vocal data and providing a new avenue for cross-modal identity transfer.

Abstract

This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.
Paper Structure (17 sections, 2 equations, 1 figure, 2 tables)

This paper contains 17 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of the proposed method, HYFace, conditional VAE based network that its speaker embedding is learned on face images only. In training phase, a predefiend speaker-wise average F0 ($f_{0,\mathit{gt}}^{\mathit{avg}}$) is used to estimate frame-wise $F0$ values. However, as the $f_{0,\mathit{gt}}^{\mathit{avg}}$ values for unseen target speakers are not available during the inference phase, we independently train an average $F0$ estimation network based solely on facial inputs. This module is then utilized in the inference phase.