Table of Contents
Fetching ...

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Yifan Liang, Andong Li, Kang Yang, Guochen Yu, Fangkun Liu, Lingling Dai, Xiaodong Li, Chengshi Zheng

TL;DR

This paper tackles high-fidelity lip-to-speech synthesis by eliminating intermediate representations and directly predicting continuous latent vectors of a neural audio codec from visual lip movements. It introduces SLD-L2S, a hierarchical subspace latent diffusion framework with a diffusion convolution backbone (DiCB) and subspace decomposition/recomposition to map visuals to codec latents, conditioned on speaker identity. A reparameterized flow matching objective, together with auxiliary semantic and SLM losses, enables stable training and improved perceptual and content fidelity, achieving state-of-the-art results on LRS3-TED and LRS2-BBC while remaining computationally efficient at inference. The work demonstrates the effectiveness of latent diffusion in L2S, offering a robust blueprint for future high-fidelity audio-visual synthesis with direct latent generation.

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

TL;DR

This paper tackles high-fidelity lip-to-speech synthesis by eliminating intermediate representations and directly predicting continuous latent vectors of a neural audio codec from visual lip movements. It introduces SLD-L2S, a hierarchical subspace latent diffusion framework with a diffusion convolution backbone (DiCB) and subspace decomposition/recomposition to map visuals to codec latents, conditioned on speaker identity. A reparameterized flow matching objective, together with auxiliary semantic and SLM losses, enables stable training and improved perceptual and content fidelity, achieving state-of-the-art results on LRS3-TED and LRS2-BBC while remaining computationally efficient at inference. The work demonstrates the effectiveness of latent diffusion in L2S, offering a robust blueprint for future high-fidelity audio-visual synthesis with direct latent generation.

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.
Paper Structure (24 sections, 8 equations, 1 figure, 5 tables)

This paper contains 24 sections, 8 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The model utilizes a latent flow matching approach to directly map visual features to the continuous acoustic latent space. The architecture consists of three main stages: a subspace decomposition module processes visual features into parallel pathways; a backbone of DiCBs, conditioned with AdaLN-SOLA, effectively refines these representations; and a subspace recomposition module fuses them into a final latent vector. The model is trained with a multi-objective function, including a flow matching loss for acoustic reconstruction, a semantic loss on the latent space, and an SLM loss on the final synthesized waveform. During inference, the model generates the latent vectors for the codec to synthesize into speech.