SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Yifan Liang; Andong Li; Kang Yang; Guochen Yu; Fangkun Liu; Lingling Dai; Xiaodong Li; Chengshi Zheng

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Yifan Liang, Andong Li, Kang Yang, Guochen Yu, Fangkun Liu, Lingling Dai, Xiaodong Li, Chengshi Zheng

TL;DR

This paper tackles high-fidelity lip-to-speech synthesis by eliminating intermediate representations and directly predicting continuous latent vectors of a neural audio codec from visual lip movements. It introduces SLD-L2S, a hierarchical subspace latent diffusion framework with a diffusion convolution backbone (DiCB) and subspace decomposition/recomposition to map visuals to codec latents, conditioned on speaker identity. A reparameterized flow matching objective, together with auxiliary semantic and SLM losses, enables stable training and improved perceptual and content fidelity, achieving state-of-the-art results on LRS3-TED and LRS2-BBC while remaining computationally efficient at inference. The work demonstrates the effectiveness of latent diffusion in L2S, offering a robust blueprint for future high-fidelity audio-visual synthesis with direct latent generation.

Abstract

Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 1 figure, 5 tables)

This paper contains 24 sections, 8 equations, 1 figure, 5 tables.

Introduction
Related Work
Multi-Speaker Lip to Speech Synthesis
Neural Audio Codec
Flow Matching
Proposed Method
Overview
Visual Frontend
Hierarchical Subspace Latent Flowmatching
Subspace Decomposition
Diffusion Convolution Block
Subspace Recomposition
Training Objectives
Reparameterized Flow Matching Loss
Auxiliary Losses
...and 9 more sections

Figures (1)

Figure 1: The model utilizes a latent flow matching approach to directly map visual features to the continuous acoustic latent space. The architecture consists of three main stages: a subspace decomposition module processes visual features into parallel pathways; a backbone of DiCBs, conditioned with AdaLN-SOLA, effectively refines these representations; and a subspace recomposition module fuses them into a final latent vector. The model is trained with a multi-objective function, including a flow matching loss for acoustic reconstruction, a semantic loss on the latent space, and an SLM loss on the final synthesized waveform. During inference, the model generates the latent vectors for the codec to synthesize into speech.

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

TL;DR

Abstract

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)