Table of Contents
Fetching ...

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech

Ji-Hoon Kim, Jeongsoo Choi, Jaehun Kim, Chaeyoung Jung, Joon Son Chung

TL;DR

This work tackles video-to-speech synthesis by bridging the modality gap between silent talking-face video and audible speech. It introduces a hierarchical visual encoder that maps video to acoustic spaces via content, timbre, and prosody, conditionally guiding a flow-matching decoder to synthesize mel-spectrograms, followed by a neural vocoder. The approach leverages AV-HuBERT and ArcFace representations and employs Optimal Transport Conditional Flow Matching with classifier-free guidance to achieve high fidelity with few sampling steps. On LRS3-TED and LRS2-BBC, the method achieves state-of-the-art naturalness, intelligibility, and speaker similarity, with MOS gaps to real speech as small as 0.05, indicating strong practical potential for re-dubbing, assistive communication, and other silent-video applications.

Abstract

The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech

TL;DR

This work tackles video-to-speech synthesis by bridging the modality gap between silent talking-face video and audible speech. It introduces a hierarchical visual encoder that maps video to acoustic spaces via content, timbre, and prosody, conditionally guiding a flow-matching decoder to synthesize mel-spectrograms, followed by a neural vocoder. The approach leverages AV-HuBERT and ArcFace representations and employs Optimal Transport Conditional Flow Matching with classifier-free guidance to achieve high fidelity with few sampling steps. On LRS3-TED and LRS2-BBC, the method achieves state-of-the-art naturalness, intelligibility, and speaker similarity, with MOS gaps to real speech as small as 0.05, indicating strong practical potential for re-dubbing, assistive communication, and other silent-video applications.

Abstract

The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.

Paper Structure

This paper contains 33 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of the proposed system. Our method learns hierarchical representations from video to speech, focusing on three key factors: lips, face IDs, and facial expressions. The visual encoding is converted into the corresponding speech through an effective flow matching decoder and neural vocoder.
  • Figure 2: The detailed architecture of the our framework. Our approach gradually closes the substantial modality gap between video and speech, while aligning key visual cues--lip movements, face identity, and facial expressions--with their corresponding speech attributes--content, timbre, and prosody. The flow matching decoder effectively estimates mel-spectrogram distribution, conditioned on the visual encoding $\boldsymbol{\mu}$. ${\bf x}_t$ represents an intermediate state of mel-spectrogram at time-step $t$, and $\phi_t^{OT}$ denotes the corresponding flow.
  • Figure 3: Speech attribute prediction pipeline. The content and prosody predictor incorporate an auxiliary masked convolution block to enrich contextual information.
  • Figure 4: Mel-spectrogram visualization compared to Ground Truth (GT) speech. As highlighted in the red boxes, the proposed method effectively captures both accurate and dynamic fundamental frequency, along with synchronized changes in facial expressions.