Table of Contents
Fetching ...

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen

TL;DR

This work tackles robustness in visual speech recognition for unseen speakers by separating speaker-specific characteristics from speech content across network layers. It introduces a four-module architecture with a speaker verification branch that yields $\theta$, a feature enhancement module for shallow layers, and a feature suppression module for deep layers, which together modulate the lip-reading backbone. The training uses a combination of losses including $L^{ID}_{triple}$, $L^{Enh}_{triple}$, $L^{Suppress}_{triple}$, and $L^{VSR}_{CE}$ or $L^{VSR}_{CTC}$, enabling both unsupervised speaker adaptation and content-driven lip reading. The method achieves consistent gains over baselines on LRW-ID and GRID, and performs well in the extreme CAS-VSR-S68 setting, with the public release of CAS-VSR-S68 to support future research.

Abstract

In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

TL;DR

This work tackles robustness in visual speech recognition for unseen speakers by separating speaker-specific characteristics from speech content across network layers. It introduces a four-module architecture with a speaker verification branch that yields , a feature enhancement module for shallow layers, and a feature suppression module for deep layers, which together modulate the lip-reading backbone. The training uses a combination of losses including , , , and or , enabling both unsupervised speaker adaptation and content-driven lip reading. The method achieves consistent gains over baselines on LRW-ID and GRID, and performs well in the extreme CAS-VSR-S68 setting, with the public release of CAS-VSR-S68 to support future research.

Abstract

In this paper, we propose a novel method for speaker adaptation in lip reading, motivated by two observations. Firstly, a speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks, while the fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks to represent accurately. Therefore, we treat the shallow and deep layers differently for speaker adaptive lip reading. Secondly, we observe that a speaker's unique characteristics ( e.g. prominent oral cavity and mandible) have varied effects on lip reading performance for different words and pronunciations, necessitating adaptive enhancement or suppression of the features for robust lip reading. Based on these two observations, we propose to take advantage of the speaker's own characteristics to automatically learn separable hidden unit contributions with different targets for shallow layers and deep layers respectively. For shallow layers where features related to the speaker's characteristics are stronger than the speech content related features, we introduce speaker-adaptive features to learn for enhancing the speech content features. For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading. Our approach consistently outperforms existing methods, as confirmed by comprehensive analysis and comparison across different settings. Besides the evaluation on the popular LRW-ID and GRID datasets, we also release a new dataset for evaluation, CAS-VSR-S68h, to further assess the performance in an extreme setting where just a few speakers are available but the speech content covers a large and diversified range.
Paper Structure (12 sections, 5 equations, 4 figures, 6 tables)

This paper contains 12 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Accuracy of Lip Reading and Identification Using the Output at Different Layers
  • Figure 2: The Overall Architecture of Our Proposed Method.
  • Figure 3: Visualization of the Generated Enhancement Weights
  • Figure 4: Adaptation Rresult Using Different Amount of Adaptation Data on LRW-ID