Table of Contents
Fetching ...

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

Minsu Kim, Hyung-Il Kim, Yong Man Ro

TL;DR

The proposed prompt tuning methods of Deep Neural Networks for speaker-adaptive VSR are proposed and it is shown that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data, even if thePre-trained model is already developed with large speaker variations.

Abstract

Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

TL;DR

The proposed prompt tuning methods of Deep Neural Networks for speaker-adaptive VSR are proposed and it is shown that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data, even if thePre-trained model is already developed with large speaker variations.

Abstract

Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.
Paper Structure (23 sections, 8 equations, 4 figures, 12 tables)

This paper contains 23 sections, 8 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Illustration of the proposed prompt tuning for speaker-adaptive VSR. (a) The general architecture of VSR models. (b) Three different types of prompts that can be applied to VSR models: i) addition form, ii) padding form, and iii) concatenation form. They can be jointly utilized to adapt the pre-trained VSR model on the unseen target speaker. We only update the prompts while the pre-trained VSR model is kept frozen.
  • Figure 2: Detailed illustrations of different types of prompt methods. (a) Addition form prompt is for being added to input video frames. (b) Padding form prompt is for being replaced the original padding region in the CNN. (c) Concatenation form prompt is for being concatenated to the input of Transformer-based module in the temporal dimension. Only prompts (i.e., green in the figure) are tuned during adaptation while maintaining the weight parameters of the pre-trained model.
  • Figure 3: WER comparisons between proposed prompt tuning and different finetuning methods according to adaptation data ratio on GRID.
  • Figure 4: ACC comparisons between proposed prompt tuning and different finetuning methods according to adaptation data ratio on LRW-ID.