Table of Contents
Fetching ...

Emotional Face-to-Speech

Jiaxin Ye, Boyuan Cao, Hongming Shan

TL;DR

This paper introduces Emotional Face-to-Speech (eF2S) and the DEmoFace framework, which synthesize emotionally expressive speech directly from facial cues while preserving speaker identity. It advances discrete speech generation by combining residual vector quantization (RVQ) with a discrete diffusion model and a multimodal diffusion transformer (MM-DiT), coupled with curriculum learning and an enhanced predictor-free guidance mechanism for multi-conditional generation. The approach jointly models identity and emotion from face inputs and text content, enabling coherent, natural-sounding speech without vocal prompts. Experimental results on multiple face-speech datasets demonstrate superior naturalness, emotion alignment, and identity consistency relative to state-of-the-art baselines, with strong qualitative and ablation evidence supporting the method’s design choices.

Abstract

How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed emotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos are shown at https://demoface-ai.github.io/.

Emotional Face-to-Speech

TL;DR

This paper introduces Emotional Face-to-Speech (eF2S) and the DEmoFace framework, which synthesize emotionally expressive speech directly from facial cues while preserving speaker identity. It advances discrete speech generation by combining residual vector quantization (RVQ) with a discrete diffusion model and a multimodal diffusion transformer (MM-DiT), coupled with curriculum learning and an enhanced predictor-free guidance mechanism for multi-conditional generation. The approach jointly models identity and emotion from face inputs and text content, enabling coherent, natural-sounding speech without vocal prompts. Experimental results on multiple face-speech datasets demonstrate superior naturalness, emotion alignment, and identity consistency relative to state-of-the-art baselines, with strong qualitative and ablation evidence supporting the method’s design choices.

Abstract

How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed emotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos are shown at https://demoface-ai.github.io/.

Paper Structure

This paper contains 61 sections, 22 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Tasks comparison. (a) Conventional Face-to-Speech (F2S). (b) The introduced Emotional Face-to-Speech (eF2S). Given text and face prompts, the model is expected to generate speech that aligns with both the facial identity and emotional expression. Our eF2S offers a novel perspective for generating consistent speech without relying on any vocal cues.
  • Figure 2: Overall framework of DEmoFace. The MM-DiT inputs masked token $x_t^{r_1:r_{12}}$, time $t$, and condition set $\bm{c}$ to synthesize speech, consisting of $N$ blocks for conditioning and $12$ linear heads to predict concrete scores. During training, we propose a curriculum learning that first inputs low-level tokens and refines them by adding high-level tokens progressively. During sampling, an Euler sampler with our EPFG refines the tokens, while a codec decoder reconstructs the waveform.
  • Figure 3: Speech qualitative results. The red rectangles highlight key regions with acoustic differences or over-smoothing issues, and the red dotted circle shows similar F0 contours with ground truth. Zoom in for more details.
  • Figure 4: t-SNE visualization of x-vectors from synthesis speeches. Each color represents a different speaker.
  • Figure 5: Ablation study on curriculum learning. (a) Feature distribution across RVQ levels, with low-level features showing low-frequency patterns. (b)-(d) For the baseline without curriculum learning, we vary the number of training epochs compared with three metrics on the validation set. The effect is evident for WER and EmoSim while slight on SpkSim.
  • ...and 3 more figures