Hear Your Face: Face-based voice conversion with F0 estimation
Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee
TL;DR
The paper addresses face-based voice conversion by deriving the target's pitch cue, $f_{0,\, ext{avg}}$, from facial images and using it to guide a conditional variational autoencoder that transforms source speech to resemble the target's vocal identity without accessing target acoustic data. HYFace integrates a face-based speaker embedding, SSL-based content representations, and a frame-wise $F0$ decoder, with an average-$F0$ estimator trained on faces to handle unseen targets. The authors introduce a pitch-deviation metric to quantify explicit face–voice associations and demonstrate state-of-the-art results on the LRS3 dataset, achieving strong objective and subjective scores while maintaining close alignment to ground-truth pitch. This work advances face-centered speech synthesis, enabling identity-preserving voice generation for speakers lacking vocal data and providing a new avenue for cross-modal identity transfer.
Abstract
This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.
