Table of Contents
Fetching ...

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Ji-Hoon Kim, Jaehun Kim, Joon Son Chung

TL;DR

This work tackles lip-to-speech by addressing the intrinsic one-to-many mapping caused by homophenes and varied speech styles. It introduces a three-component system: a video encoder, a variance decoder that injects SSL-based linguistic cues and acoustic variances (pitch and energy), and a flow-based post-net to refine mel-spectrogram details, followed by a neural vocoder. By leveraging intermediate HuBERT representations as linguistic conditioning and modeling pitch/energy alongside a flow-based refinement, the method achieves near-human speech quality, outperforming prior LTS approaches in naturalness and intelligibility on GRID and Lip2Wav datasets. The results demonstrate robust performance in both constrained and unconstrained settings, with ablations confirming the critical roles of linguistic conditioning, acoustic variance, and post-net refinement. The approach promises practical impact for redubbing silent media and aiding individuals with speech impairments, while rightly addressing potential misuse and privacy concerns.

Abstract

The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

TL;DR

This work tackles lip-to-speech by addressing the intrinsic one-to-many mapping caused by homophenes and varied speech styles. It introduces a three-component system: a video encoder, a variance decoder that injects SSL-based linguistic cues and acoustic variances (pitch and energy), and a flow-based post-net to refine mel-spectrogram details, followed by a neural vocoder. By leveraging intermediate HuBERT representations as linguistic conditioning and modeling pitch/energy alongside a flow-based refinement, the method achieves near-human speech quality, outperforming prior LTS approaches in naturalness and intelligibility on GRID and Lip2Wav datasets. The results demonstrate robust performance in both constrained and unconstrained settings, with ablations confirming the critical roles of linguistic conditioning, acoustic variance, and post-net refinement. The approach promises practical impact for redubbing silent media and aiding individuals with speech impairments, while rightly addressing potential misuse and privacy concerns.

Abstract

The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin. Synthesised samples are available at our demo page: https://mm.kaist.ac.kr/projects/LTBS.
Paper Structure (30 sections, 7 equations, 2 figures, 5 tables)

This paper contains 30 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: In subfigure (a) and (b), $\boldsymbol{e}_{spk}$ is a speaker embedding. In subfigure (a) and (c), $\boldsymbol{h}_v$ denotes the encoded video feature. In (c), Emb.T. refers to an embedding table. In subfigure (d), the paths with dotted lines are operated only in a training stage. $\boldsymbol{y}_{mel}$ and $\hat{\boldsymbol{y}}_{mel}$ refer to the ground truth and predicted mel-spectrogram, respectively. $cond$ means the post-net conditions which contain the input and output of the conformer decoder, and $\boldsymbol{e}_{spk}$. In our experiment, we set $N=8$.
  • Figure 2: Visualisation of mel-spectrogram. Note that the proposed method better captures fine details with frequency correlations compared to other methods ((c)-(e)), particularly in red boxes.