Table of Contents
Fetching ...

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition

Bowen Hao, Dongliang Zhou, Xiaojie Li, Xingyu Zhang, Liang Xie, Jianlong Wu, Erwei Yin

TL;DR

Visual speech recognition suffers from limited lip movement diversity in datasets. LipGen addresses this by generating a large, diverse set of synthetic lip videos via a speech-driven diffusion model and by introducing a viseme-guided auxiliary task plus a temporal attention fusion module. The method yields state-of-the-art LRW performance (92.8% ACC) and improved robustness to pose variations (pose-augmented 81.8% vs 80.5%). This approach broadens data diversity, reduces sensitivity to non-lip cues, and offers practical gains for real-world VSR systems.

Abstract

Visual speech recognition (VSR), commonly known as lip reading, has garnered significant attention due to its wide-ranging practical applications. The advent of deep learning techniques and advancements in hardware capabilities have significantly enhanced the performance of lip reading models. Despite these advancements, existing datasets predominantly feature stable video recordings with limited variability in lip movements. This limitation results in models that are highly sensitive to variations encountered in real-world scenarios. To address this issue, we propose a novel framework, LipGen, which aims to improve model robustness by leveraging speech-driven synthetic visual data, thereby mitigating the constraints of current datasets. Additionally, we introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms. This approach facilitates the efficient integration of temporal information, directing the model's focus toward the relevant segments of speech, thereby enhancing discriminative capabilities. Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset and exhibits even more pronounced advantages under challenging conditions.

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition

TL;DR

Visual speech recognition suffers from limited lip movement diversity in datasets. LipGen addresses this by generating a large, diverse set of synthetic lip videos via a speech-driven diffusion model and by introducing a viseme-guided auxiliary task plus a temporal attention fusion module. The method yields state-of-the-art LRW performance (92.8% ACC) and improved robustness to pose variations (pose-augmented 81.8% vs 80.5%). This approach broadens data diversity, reduces sensitivity to non-lip cues, and offers practical gains for real-world VSR systems.

Abstract

Visual speech recognition (VSR), commonly known as lip reading, has garnered significant attention due to its wide-ranging practical applications. The advent of deep learning techniques and advancements in hardware capabilities have significantly enhanced the performance of lip reading models. Despite these advancements, existing datasets predominantly feature stable video recordings with limited variability in lip movements. This limitation results in models that are highly sensitive to variations encountered in real-world scenarios. To address this issue, we propose a novel framework, LipGen, which aims to improve model robustness by leveraging speech-driven synthetic visual data, thereby mitigating the constraints of current datasets. Additionally, we introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms. This approach facilitates the efficient integration of temporal information, directing the model's focus toward the relevant segments of speech, thereby enhancing discriminative capabilities. Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset and exhibits even more pronounced advantages under challenging conditions.
Paper Structure (11 sections, 2 equations, 3 figures, 3 tables)

This paper contains 11 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed lip reading model architecture. (a) The pipeline of the lip video data synthesis. (b) Training pipeline of LipGen.
  • Figure 2: Examples of diverse synthetic lip movement data generated by the lip animation model, illustrating a variety of conditions and speaker variations.
  • Figure 3: Phoneme-to-viseme mapping used for auxiliary labeling.