Table of Contents
Fetching ...

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

TL;DR

The paper tackles lip reading for low-resource languages by decoupling learning into general speech knowledge learned from a high-resource language through masked speech-unit prediction and language-specific knowledge learned from audio-text data via a Language-specific Memory-augmented Decoder (LMDecoder). By combining the visually encoded general speech representations with memory-guided language-specific features through attention, the approach enables accurate lip reading even with scarce video-text data. Empirical results show state-of-the-art performance on English (LRS2) and clear improvements across multiple low-resource languages (ES, FR, IT, PT), with ablations confirming the benefits of the LM and the use of additional audio-text data. This framework offers a practical path to scalable lip reading for diverse languages, leveraging cross-lingual speech knowledge transfer and rich language models without requiring extensive video-text corpora.

Abstract

This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

TL;DR

The paper tackles lip reading for low-resource languages by decoupling learning into general speech knowledge learned from a high-resource language through masked speech-unit prediction and language-specific knowledge learned from audio-text data via a Language-specific Memory-augmented Decoder (LMDecoder). By combining the visually encoded general speech representations with memory-guided language-specific features through attention, the approach enables accurate lip reading even with scarce video-text data. Empirical results show state-of-the-art performance on English (LRS2) and clear improvements across multiple low-resource languages (ES, FR, IT, PT), with ablations confirming the benefits of the LM and the use of additional audio-text data. This framework offers a practical path to scalable lip reading for diverse languages, leveraging cross-lingual speech knowledge transfer and rich language models without requiring extensive video-text corpora.

Abstract

This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.
Paper Structure (28 sections, 4 equations, 2 figures, 17 tables)

This paper contains 28 sections, 4 equations, 2 figures, 17 tables.

Figures (2)

  • Figure 1: Overview of the proposed method for low-resource language lip reading. (a) Learning general speech representation by using masked prediction of speech units in a high-resource language. (b) The proposed Language-specific Memory-augmented Decoder (LMDecoder) learns language-specific knowledge from audio-text paired data by quantizing the input into speech units. (c) Lip reading models for low-resource languages can be built by combining general speech knowledge and language-specific knowledge.
  • Figure 2: Illustration of Language-specific Memory (LM) and the memory banks $B$ of LM. When a quantized speech unit is given, LM transforms it into a language-specific audio feature by reading the memory value. Therefore, the mapping of speech units to language-specific audio features can be constructed.