Hear Your Face: Face-based voice conversion with F0 estimation

Jaejun Lee; Yoori Oh; Injune Hwang; Kyogu Lee

Hear Your Face: Face-based voice conversion with F0 estimation

Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee

TL;DR

The paper addresses face-based voice conversion by deriving the target's pitch cue, $f_{0,\, ext{avg}}$, from facial images and using it to guide a conditional variational autoencoder that transforms source speech to resemble the target's vocal identity without accessing target acoustic data. HYFace integrates a face-based speaker embedding, SSL-based content representations, and a frame-wise $F0$ decoder, with an average-$F0$ estimator trained on faces to handle unseen targets. The authors introduce a pitch-deviation metric to quantify explicit face–voice associations and demonstrate state-of-the-art results on the LRS3 dataset, achieving strong objective and subjective scores while maintaining close alignment to ground-truth pitch. This work advances face-centered speech synthesis, enabling identity-preserving voice generation for speakers lacking vocal data and providing a new avenue for cross-modal identity transfer.

Abstract

This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics, including tracking of the target speaker's fundamental frequency.

Hear Your Face: Face-based voice conversion with F0 estimation

TL;DR

The paper addresses face-based voice conversion by deriving the target's pitch cue,

, from facial images and using it to guide a conditional variational autoencoder that transforms source speech to resemble the target's vocal identity without accessing target acoustic data. HYFace integrates a face-based speaker embedding, SSL-based content representations, and a frame-wise

decoder, with an average-

estimator trained on faces to handle unseen targets. The authors introduce a pitch-deviation metric to quantify explicit face–voice associations and demonstrate state-of-the-art results on the LRS3 dataset, achieving strong objective and subjective scores while maintaining close alignment to ground-truth pitch. This work advances face-centered speech synthesis, enabling identity-preserving voice generation for speakers lacking vocal data and providing a new avenue for cross-modal identity transfer.

Abstract

Paper Structure (17 sections, 2 equations, 1 figure, 2 tables)

This paper contains 17 sections, 2 equations, 1 figure, 2 tables.

Introduction
Related work
Voice conversion
Face-voice association
Methods
HYFace
Model architecture
Experiments
Dataset
Comparison systems
Metrics
Results
Objective results
Subjective results
Pitch deviations
...and 2 more sections

Figures (1)

Figure 1: Overview of the proposed method, HYFace, conditional VAE based network that its speaker embedding is learned on face images only. In training phase, a predefiend speaker-wise average F0 ($f_{0,\mathit{gt}}^{\mathit{avg}}$) is used to estimate frame-wise $F0$ values. However, as the $f_{0,\mathit{gt}}^{\mathit{avg}}$ values for unseen target speakers are not available during the inference phase, we independently train an average $F0$ estimation network based solely on facial inputs. This module is then utilized in the inference phase.

Hear Your Face: Face-based voice conversion with F0 estimation

TL;DR

Abstract

Hear Your Face: Face-based voice conversion with F0 estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)