Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

Yejin Jeon; Youngjae Kim; Jihyun Lee; Hyounghun Kim; Gary Geunbae Lee

Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

Yejin Jeon, Youngjae Kim, Jihyun Lee, Hyounghun Kim, Gary Geunbae Lee

TL;DR

This work introduces an end-to-end face-to-voice synthesis framework that jointly learns a fine-grained facial identity embedding and a voice by integrating progressively granular facial features into a VITS-based speech model. A bilateral attribute-based enhancement guides alignment of gender and ethnicity cues across both visual and audio modalities, while multi-view data augmentation exposes the model to diverse visual conditions. Empirical results on LRS3 show improved speaker fidelity, face–voice congruence, and robustness to unseen speakers, with ablations validating the contribution of progressive feature extraction, attribute supervision, and data augmentation. The approach advances personalized speech synthesis from facial images, offering potential assistive communication tools, while addressing ethical considerations such as misuse risk and demographic sensitivity.

Abstract

For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user's own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

TL;DR

Abstract

Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)