Table of Contents
Fetching ...

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

TL;DR

The paper introduces VSP-LLM, a unified framework that maps visual speech into the text space of a Large Language Model to perform both visual speech recognition and translation. It leverages a self-supervised visual encoder (AV-HuBERT), introduces visual speech units for deduplication to reduce sequence length, and employs instruction-driven multi-task learning with QLoRA to achieve data-efficient training. Key contributions include the first unified VSR+VST model, a novel deduplication method that significantly lowers computation, and state-of-the-art results on LRS3 and MuAViC with as little as 30 hours of labeled data, as well as strong performance in data-limited settings. The approach demonstrates the practical potential of integrating rich language-based context into visual speech processing, enabling robust disambiguation of homophenes and improved translation across languages with limited data.

Abstract

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

TL;DR

The paper introduces VSP-LLM, a unified framework that maps visual speech into the text space of a Large Language Model to perform both visual speech recognition and translation. It leverages a self-supervised visual encoder (AV-HuBERT), introduces visual speech units for deduplication to reduce sequence length, and employs instruction-driven multi-task learning with QLoRA to achieve data-efficient training. Key contributions include the first unified VSR+VST model, a novel deduplication method that significantly lowers computation, and state-of-the-art results on LRS3 and MuAViC with as little as 30 hours of labeled data, as well as strong performance in data-limited settings. The approach demonstrates the practical potential of integrating rich language-based context into visual speech processing, enabling robust disambiguation of homophenes and improved translation across languages with limited data.

Abstract

In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.
Paper Structure (25 sections, 1 equation, 7 figures, 7 tables)

This paper contains 25 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of our VSP-LLM framework. Visual speech representations encoded from the visual encoder are mapped to visual speech units. Then the visual speech representations are reduced through averaging based on the mapped visual speech units. These reduced representations are fed into the LLM along with text instructions.
  • Figure 2: Textulaization results of the visual speech representations. GT, (a), and (b) indicate the ground truth, textualization without deduplication, and textualization with deduplication, respectively.
  • Figure 3: The qualitative results showing that the contextual modeling ability of LLM, which is adopted in our method, can improve the homophene problem and other confusing cases. The red and blue words indicate the wrong predictions from AV-HuBERT. However, as shown in the examples, the proposed method can generate correct words by considering the entire context (e.g., 'i' to 'eye').
  • Figure 4: VSR performance analysis on LRS3 with varying video length of test samples. Due to the strength of contextual understanding ability of LLM, the proposed method shows superior performance with longer videos.
  • Figure 5: Visualization results showing how video frame features are deduplicated and mapped into visual speech units. By doing so, the redundant frame features can be reduced efficiently.
  • ...and 2 more figures