Table of Contents
Fetching ...

Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

Yejin Jeon, Youngjae Kim, Jihyun Lee, Hyounghun Kim, Gary Geunbae Lee

TL;DR

This work introduces an end-to-end face-to-voice synthesis framework that jointly learns a fine-grained facial identity embedding and a voice by integrating progressively granular facial features into a VITS-based speech model. A bilateral attribute-based enhancement guides alignment of gender and ethnicity cues across both visual and audio modalities, while multi-view data augmentation exposes the model to diverse visual conditions. Empirical results on LRS3 show improved speaker fidelity, face–voice congruence, and robustness to unseen speakers, with ablations validating the contribution of progressive feature extraction, attribute supervision, and data augmentation. The approach advances personalized speech synthesis from facial images, offering potential assistive communication tools, while addressing ethical considerations such as misuse risk and demographic sensitivity.

Abstract

For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user's own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis

TL;DR

This work introduces an end-to-end face-to-voice synthesis framework that jointly learns a fine-grained facial identity embedding and a voice by integrating progressively granular facial features into a VITS-based speech model. A bilateral attribute-based enhancement guides alignment of gender and ethnicity cues across both visual and audio modalities, while multi-view data augmentation exposes the model to diverse visual conditions. Empirical results on LRS3 show improved speaker fidelity, face–voice congruence, and robustness to unseen speakers, with ablations validating the contribution of progressive feature extraction, attribute supervision, and data augmentation. The approach advances personalized speech synthesis from facial images, offering potential assistive communication tools, while addressing ethical considerations such as misuse risk and demographic sensitivity.

Abstract

For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user's own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

Paper Structure

This paper contains 18 sections, 4 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Diagram illustrating the proposed model. On a high level, an input image is segmented into 16 smallest-size patches. Four sets of adjacent patches are then combined to create four larger areas. These four areas are subsequently aggregated into a final representation $F$, which ultimately integrates information from all 16 original patches. Moreover, attribute enhancement is further conducted on both the facial and acoustic domains. Note that the architecture following the text encoder output is identical to the original VITS vits model.
  • Figure 2: Gender classification agreement (pink) and accuracy (blue) scores in percentage. Agreement is quantified by the number of annotator votes assigning the synthesized voice to a specific gender (male or female). Accuracy is determined based on the majority vote outcome relative to the ground truth gender.
  • Figure 4: ABX (blue) and SECS (red) scores for the out-of-domain GRID dataset.
  • Figure 5: Visualization of collective modality attribute loss.
  • Figure : (a) Pluster*
  • ...and 9 more figures