RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Long-Khanh Pham; Thanh V. T. Tran; Minh-Tan Pham; Van Nguyen

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Long-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen

TL;DR

RESOUND tackles lip-to-speech under realistic conditions by decoupling acoustic prosody from semantic content via a dual-path framework grounded in source-filter theory. The acoustic branch predicts speaker-specific prosody (pitch, energy, timbre) conditioned on a brief speaker prompt, while the semantic branch extracts linguistic content from silent video using AVHuBERT, L2T, and a cross-modal mapping with a Conformer-based attention mechanism. A Spec-Ling Decoder then fuses mel-spectrograms and discrete speech units to synthesize expressive, intelligible speech, with a probability of improved content accuracy evidenced by lower WER and higher ESTOI. Experimental results on LRS2-BBC and LRS3-TED show RESOUND achieving state-of-the-art performance across multiple objective and perceptual metrics, validating the benefits of explicit prosody-semantics disentanglement and multimodal fusion for real-world lip-to-speech applications.

Abstract

Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

TL;DR

Abstract

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)