Table of Contents
Fetching ...

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin

TL;DR

Cross-speaker lip reading faces significant inter-speaker variation that degrades generalization. The authors propose landmark-guided visual clues and max-min mutual information regularization within a hybrid CTC/attention framework, enabling speaker-insensitive representations. The method combines landmark-centered 3D patches, intra-frame relative positions, and inter-frame lip motions fed into a Conformer encoder with a Transformer decoder, trained in two stages and augmented by a speaker-identification branch. On GRID, the approach shows improved generalization for unseen and overlapped speakers, with potential for further gains when integrated with mouth-centered cues, supporting practical deployment in noisy or multi-speaker settings.

Abstract

Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

TL;DR

Cross-speaker lip reading faces significant inter-speaker variation that degrades generalization. The authors propose landmark-guided visual clues and max-min mutual information regularization within a hybrid CTC/attention framework, enabling speaker-insensitive representations. The method combines landmark-centered 3D patches, intra-frame relative positions, and inter-frame lip motions fed into a Conformer encoder with a Transformer decoder, trained in two stages and augmented by a speaker-identification branch. On GRID, the approach shows improved generalization for unseen and overlapped speakers, with potential for further gains when integrated with mouth-centered cues, supporting practical deployment in noisy or multi-speaker settings.

Abstract

Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.
Paper Structure (30 sections, 14 equations, 3 figures, 4 tables)

This paper contains 30 sections, 14 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of the overall multi-task learning framework for cross-speaker lip reading. The model inputs are derived from the mouth-centered crops coupled with lip landmarks.
  • Figure 2: Performance comparison of different patch size (ranged from 20 to 32) in the overlapped and unseen speaker settings. The dashed line indicates the recognition performance using flexible patch size.
  • Figure 3: Attention weight maps between lip landmarks (indices from 49 to 68) from the attentive intra-frame fusion module. The weights are calculated by averaging over all the self-attention heads. The video clip used here is drawn from the test set. Darker colors indicate larger weight values.